System and Method for Code and Data Versioning in Computerized Data Modeling and Analysis

ABSTRACT

Code and data versioning allow developers to work on code and data without affecting production code and data and without affecting the development activities of other developers. Code and data being worked on by a developer are associated with a task. The system automatically determines the dataset to use for a given development task from among a production dataset, a latest dataset, or a temporary dataset associated with the development task so that development code does not have to be modified to read from a specific dataset.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of, and therefore claims priorityfrom, U.S. patent application Ser. No. 15/629,342 entitled SYSTEM ANDMETHOD FOR CODE AND DATA VERSIONING IN COMPUTERIZED DATA MODELING ANDANALYSIS filed on Jun. 21, 2017 (issuing as U.S. Pat. No. 11,175,910 onNov. 16, 2021), which is a continuation-in-part of, and therefore claimspriority from, U.S. patent application Ser. No. 15/388,388 entitledSYSTEM AND METHOD FOR RAPID DEVELOPMENT AND DEPLOYMENT OF REUSABLEANALYTIC CODE FOR USE IN COMPUTERIZED DATA MODELING AND ANALYSIS filedon Dec. 22, 2016 (now U.S. Pat. No. 10,394,532 issued Aug. 27, 2019),which claims the benefit of U.S. Provisional Application No. 62/271,041filed on Dec. 22, 2015; each of which patent applications is herebyincorporated herein by reference in its entirety.

This application also may be related to one or more of the followingcommonly-owned patent applications filed on Jun. 21, 2017, each of whichis hereby incorporated herein by reference in its entirety:

U.S. patent application Ser. No. 15/629,316 entitled SYSTEM AND METHODFOR INTERACTIVE REPORTING IN COMPUTERIZED DATA MODELING AND ANALYSIS(U.S. Pat. No. 10,275,502 issued Apr. 30, 2019); and

U.S. patent application Ser. No. 15/629,328 entitled SYSTEM AND METHODFOR OPTIMIZED QUERY EXECUTION IN COMPUTERIZED DATA MODELING AND ANALYSIS(U.S. Pat. No. 10,268,753 issued Apr. 23, 2019).

FIELD OF THE DISCLOSURE

The present disclosure relates generally to computer-based tools fordeveloping and deploying analytic computer code. More specifically, thepresent disclosure relates to a system and method for rapid developmentand deployment of reusable analytic code for use in computerized datamodeling and analysis.

BACKGROUND

In today's information technology world, there is an increased interestin processing “big” data to develop insights (e.g., better analyticalinsight, better customer understanding, etc.) and business advantages(e.g., in enterprise analytics, data management processes, etc.).Customers leave an audit trail or digital log of the interactions,purchases, inquiries, and preferences through online interactions withan organization. Discovering and interpreting audit trails within bigdata provides a significant advantage to companies looking to realizegreater value from the data they capture and manage every day.Structured, semi-structured, and unstructured data points are beinggenerated and captured at an ever-increasing pace, thereby forming bigdata, which is typically defined in terms of velocity, variety, andvolume. Big data is fast-flowing, ever-growing, heterogeneous, and hasexceedingly noisy input, and as a result transforming data into signalsis critical. As more companies (e.g., airlines, telecommunicationscompanies, financial institutions, etc.) focus on real-world use cases,the demand for continually refreshed signals will continue to increase.

Due to the depth and breadth of available data, data science (and datascientists) is required to transform complex data into simple digestibleformats for quick interpretation and understanding. Thus, data science,and in particular, the field of data analytics, focuses on transformingbig data into business value (e.g., helping companies anticipatecustomer behaviors and responses). The current analytic approach tocapitalize on big data starts with raw data and ends with intelligence,which is then used to solve a particular business need so that data isultimately translated into value.

However, a data scientist tasked with a well-defined problem (e.g., rankcustomers by probability of attrition in the next 90 days) is requiredto expend a significant amount of effort on tedious manual processes(e.g., aggregating, analyzing, cleansing, preparing, and transformingraw data) in order to begin conducting analytics. In such an approach,significant effort is spent on data preparation (e.g., cleaning,linking, processing), and less is spent on analytics (e.g., businessintelligence, visualization, machine learning, model building).

Further, usually the intelligence gathered from the data is not sharedacross the enterprise (e.g., across use cases, business units, etc.) andis specific to solving a particular use case or business scenario. Inthis approach, whenever a new use case is presented, an entirely newanalytics solution needs to be developed, such that there is no reuse ofintelligence across different use cases. Each piece of intelligence thatis derived from the data is developed from scratch for each use casethat requires it, which often means that it's being recreated multipletimes for the same enterprise. There are no natural economies of scalein the process, and there are not enough data scientists to tackle thegrowing number of business opportunities while relying on suchtechniques. This can result in inefficiencies and waste, includinglengthy use case execution and missed business opportunities.

Currently, to conduct analytics on “big” data, data scientists are oftenrequired to develop large quantities of software code. Often, such codeis expensive to develop, is highly customized, and is not easily adoptedfor other uses in the analytics field. Minimizing redundant costs andshortening development cycles requires significantly reducing the amountof time that data scientists spend managing and coordinating raw data.Further, optimizing this work can allow data scientists to improve theireffectiveness by honing signals and ultimately improving the foundationthat drives faster results and business responsiveness. Thus, there is aneed for a system to rapidly develop and deploy analytic code for rapiddevelopment and deployment of reusable analytic code for use incomputerized data modeling and analysis.

SUMMARY

The present disclosure relates to a system and method for rapiddevelopment and deployment of reusable analytic code for use incomputerized data modeling and analysis. The system includes acentralized, continually updated environment to capture pre-processingsteps used in analyzing big data, such that the complex transformationsand calculations become continually fresh and accessible to thoseinvestigating business opportunities. This centralized, continuallyrefreshed system provides a data-centric competitive advantage for users(e.g., to serve customers better, reduce costs, etc.), as it providesthe foresight to anticipate future problems and reuses developmentefforts. The system incorporates deep domain expertise as well asongoing expertise in data science, big data architecture, and datamanagement processes. In particular, the system allows for rapiddevelopment and deployment of analytic code that can easily be re-usedin various data analytics applications, and on multiple computersystems.

Benefits of the system include a faster time to value as data scientistscan now assemble pre-existing ETL (extract, transform, and load)processes as well as signal generation components to tackle new usecases more quickly. The present disclosure is a technological solutionfor coding and developing software to extract information for “big data”problems. The system design allows for increased modularity byintegrating with various other platforms seamlessly. The system designalso incorporates a new technological solution for creating “signals”which allows a user to extract information from “big data” by focusingon high-level issues in obtaining the data the user desires and nothaving to focus on the low-level minutia of coding big data software aswas required by previous systems. The present disclosure allows forreduced software development complexity, quicker software developmentlifecycle, and reusability of software code.

In accordance with one embodiment of the invention, acomputer-implemented method, system, and computer program product areprovided for code and data versioning for managing shared datasets in acollaborative data processing system including data files and codefiles, the data files including production data files, the code filesincluding production code files. The computer-implemented method,system, and computer program product perform processes includingmaintaining a storage system for storing the data files and code files;receiving a request by a given user to modify a given production codefile; establishing a task for the user; placing a lock on the givenproduction code file; storing a modified version of the given productioncode file in a logical partition of the storage system associated withthe task; applying the modified version of the given production codefile to a specified data file to create a modified version of thespecified data file; assigning a first unique version identifier for themodified version of the specified data file; and storing the modifiedversion of the specified data file in the logical partition of thestorage system associated with the task in a manner accessible using thefirst unique version identifier, such that the modified code file isisolated from the production code files and code files of other users,and the modified data filed is isolated from the production data filesand data files of other users.

In various alternative embodiments, the specified data file may be aproduction data file or a modified version of a production data file.The logical partition may be a folder. The storage associated with thetask includes an append-only file system and wherein the modifiedversion of the specified data file includes append-only filesrepresenting changes relative to the specified data file.

Additionally or alternatively, the processes may further includecommitting the modified version of the given production code file suchthat the modified version of the given production code file isdesignated as the latest version of the given production code file; andcommitting the modified version of the specified data file such that themodified version of the specified data file is designated as the latestversion of the specified data file among the set of production datafiles. Data label features and a plurality of configuration files may beused to allow the user to publish and use the latest version of analyticcode. The user's workspace may be isolated from previous versions ofanalytic code so that the user does not encounter interruptions from newversions of the analytic code.

In accordance with another embodiment of the invention, acomputer-implemented method, system, and computer-program product areprovided for code and data versioning for managing shared datasets in acollaborative data processing system including data files and codefiles, the data files including production data files, the code filesincluding production code files. The computer-implemented method,system, and computer program product perform processes includingmaintaining a first view having a production version of a dataset;creating a task for a developer, the task being associated with a secondview; associating a first code file with the task for the first view,the first code file including code that modifies the dataset; creating atemporary version of the dataset in the first view; associating thetemporary version of the dataset with the task; associating a secondcode file with the task for the second view, the second code fileincluding an instruction to read the dataset from the first view withoutidentifying a specific version of the dataset from the first view; andupon execution of the code file in the second view, automaticallyreading from the temporary dataset associated with the task based on theassociation of the temporary dataset with the task.

In various alternative embodiments, locks may be placed on the firstcode file in the first view and the second code file in the second view.Placing locks on the first and second code files may involve checkingout the first and second code files from a source control system. Theprocesses may further include receiving, from the developer, a requestto commit changes made to the first and second code files; checking thefirst and second code files into a source control system; changing thetemporary dataset to be a latest dataset; and terminating the task.

Additional embodiments may be disclosed and claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing features of the disclosure will be apparent from thefollowing Detailed Description, taken in connection with theaccompanying drawings, in which:

FIG. 1 is a diagram illustrating hardware and software components of thesystem;

FIG. 2 is a diagram of a traditional data signal architecture;

FIG. 3 is a diagram of a new data signal architecture provided by thesystem;

FIGS. 4A-4C are diagrams illustrating the system in greater detail;

FIG. 5 is a screenshot illustrating an integrated developmentenvironment generated by the system;

FIG. 6 is a diagram illustrating signal library and potential use casesof the system;

FIG. 7 is a diagram illustrating analytic model development anddeployment carried out by the system;

FIG. 8 is a diagram illustrating hardware and software components of thesystem in one implementation;

FIGS. 9-10 are diagrams illustrating hardware and software components ofthe system during development and production;

FIG. 11 is a screenshot illustrating data profiles for each column usingthe integrated development environment generated by the system;

FIG. 12 is a screenshot illustrating profiling of raw data using theintegrated development environment generated by the system;

FIG. 13 is a screenshot illustrating displaying of specific entrieswithin raw data using the integrated development environment generatedby the system;

FIG. 14 is a screenshot illustrating aggregating and cleaning of rawdata using the integrated development environment generated by thesystem;

FIG. 15 is a screenshot illustrating managing and confirmation of rawdata quality using the integrated development environment generated bythe system;

FIG. 16 is a screenshot illustrating auto-generated visualization of adata model created using the integrated development environment;

FIG. 17A is a screenshot illustrating creation of reusable analytic codeusing the Workbench 500 generated by the system;

FIG. 17B is a screenshot illustrating the graphical user interfacegenerated by the Signal Builder component of the Workbench of thesystem;

FIG. 18 is a screenshot illustrating a user interface screen generatedby the system for visualizing signal paths using the Knowledge Centergenerated by the system;

FIG. 19 is a screenshot illustrating a user interface screen generatedby the system for visualizing a particular signal using the KnowledgeCenter generated by the system;

FIG. 20A is a screenshot illustrating a user interface screen generatedby the system for finding a signal using the Knowledge Center generatedby the system;

FIG. 20B is a screenshot illustrating a user interface screen generatedby the system for finding a signal using the Knowledge Center 600generated by the system;

FIGS. 21A-F are screenshots illustrating user interface screensgenerated by the system for selecting entries with particular signalvalues using the Knowledge Center generated by the system;

FIG. 22 is a screenshot illustrating a user interface screen generatedby the system for visualizing signal parts of a signal using theKnowledge Center generated by the system;

FIG. 23A is a screenshot illustrating a user interface screen generatedby the system for visualizing a lineage of a signal using the KnowledgeCenter generated by the system;

FIG. 23B is a screenshot illustrating a user interface screen generatedby the system for displaying signal values, statistics and visualizationof signal value distribution;

FIG. 24A is a screenshot illustrating preparation of data to train amodel using the integrated development environment generated by thesystem;

FIG. 24B is a screenshot illustrating a graphical user interfacegenerally by the system of allowing users to select from a variety ofmodel algorithms (e.g., logistic regression, deep autoencoder, etc.);

FIG. 24C is a screenshot illustrating the different parameterexperiments users can apply during the model training process;

FIGS. 24D-J are screenshots illustrating the model training process ingreater detail;

FIG. 25A is a screenshot illustrating training of a model using theWorkbench subsystem of the present disclosure;

FIG. 25B is a screenshot illustrating preparation of data to train amodel using the Workbench subsystem of the present disclosure;

FIG. 25C is a screenshot illustrating different data splitting optionsprovided by the Workbench subsystem of the present disclosure;

FIG. 26 is another screenshot illustrating loading an external modeltrained outside of the integrated development environment;

FIG. 27 is a screenshot illustrating scoring a model using theintegrated development environment generated by the system;

FIG. 28 is a screenshot illustrating monitoring model performance usingthe integrated development environment generated by the system;

FIG. 29A is a screenshot illustrating a solution dependency diagram ofthe integrated development environment generated by the system;

FIG. 29B is a screenshot illustrating a collaborative analytic solutiondevelopment using the Workbench subsystem of the present disclosure;

FIGS. 29C-29J are screenshots illustrating environment files forenhancing collaboration;

FIGS. 30A-32 are screenshots illustrating the Signal Hub managergenerated by the system; and

FIG. 33 is a diagram showing hardware and software components of thesystem.

FIG. 34 is a screenshot illustrating a sample interactive reporting userinterface screen generated by Signal Hub platform, in accordance with anexemplary embodiment.

FIG. 35 is a screenshot illustrating a sample interactive reporting userinterface screen generated by Signal Hub platform, in accordance with anexemplary embodiment.

FIG. 36 shows a representation of the screenshot of FIG. 34 showing anexample of the results of a changed signal value.

FIG. 37 shows a representation of the screenshot of FIG. 35 based on thechanges reflected in FIG. 36.

FIG. 38 is a flowchart of a computer-implemented method for interactivedatabase reporting, in accordance with one exemplary embodiment.

FIG. 39 is a schematic diagram showing a functional dependency graph forthe above example, in accordance with one exemplary embodiment.

FIG. 40 is a flowchart for query execution optimization in accordancewith one exemplary embodiment.

FIG. 41 is a schematic diagram showing a dependency graph representinghow datasets can be derived from other datasets.

FIG. 42 is a schematic diagram showing how a production process in oneview is automatically read from “no version” production dataset, inaccordance with one exemplary embodiment.

FIG. 43 is a schematic diagram showing how a development process in oneview is automatically read from a “latest” dataset, in accordance withone exemplary embodiment.

FIG. 44 is a schematic diagram showing how a development process in atask view is automatically read from a corresponding temporary datasetassociated with the task, in accordance with one exemplary embodiment.

FIG. 45 is a schematic diagram showing an example of file updates usingan append-only file system, in accordance with one exemplary embodiment.

FIG. 46 is a schematic diagram illustrating an example signal creationlayer methodology in accordance with one exemplary embodiment.

DETAILED DESCRIPTION

Disclosed herein is a system and method for rapid development anddeployment of reusable analytic code for use in computerized datamodeling and analysis, as discussed in detail below in connection withFIGS. 1-46.

As used herein, the terms “signal” and “signals” refers to the dataelements, patterns, and calculations that have, through scientificexperimentation, been proven valuable in predicting a particularoutcome. Signals can be generated by the system using analytic code thatcan be rapidly developed, deployed, and reused. Signals carry usefulinformation about behaviors, events, customers, systems, interactions,attributes, and can be used to predict future outcomes. In effect,signals capture underlying drivers and patterns to create useful,accurate inputs that are capable of being processed by a machine intoalgorithms. High-quality signals are necessary to distill therelationships among all the entities surrounding a problem and acrossall the attributes (including their time dimension) associated withthese entities. For many problems, high-quality signals are as importantin generating an accurate prediction as the underlying machine-learningalgorithm that acts upon these signals in creating the prescriptiveaction.

The system of the present disclosure is referred to herein as “SignalHub.” Signal Hub enables transforming data into intelligence as analyticcode and then maintaining the intelligence as signals in acomputer-based production environment that allows an entire organizationto access and exploit the signals for value creation. In a given domain,many signals can be similar and reusable across different use cases andmodels. This signal-based approach enables data scientists to “writeonce and reuse everywhere,” as opposed to the traditional approach of“write once and reuse never.” The system provides signals (and theaccompanying analytic code) in the fastest, most cost-effective methodavailable, thereby accelerating the development of data scienceapplications and lowering the cost of internal development cycles.Signal Hub allows ongoing data management tasks to be performed bysystems engineers, shifting more mundane tasks away from scarce datascientists.

Signal Hub integrates data from a variety of sources, which enables theprocess of signal creation and utilization by business users andsystems. Signal Hub provides a layer of maintained and refreshedintelligence (e.g., Signals) on top of the raw data that serves as arepository for scientists (e.g., data scientists) and developers (e.g.,application developers) to execute analytics. This prevents users fromhaving to go back to the raw data for each new use case, and can insteadbenefit from existing signals stored in Signal Hub. Signal Hubcontinually extracts, stores, refreshes, and delivers the signals neededfor specific applications, such that application developers and datascientists can work directly with signals rather than raw data. As thenumber of signals grows, the model development time shrinks. In this“bow tie” architecture, model developers concentrate on creating thebest predictive models with expedited time to value for analytics.Signal Hub is highly scalable in terms of processing large amounts ofdata as well as supporting the implementation of a myriad of use cases.Signal Hub could be enterprise-grade, which means that in addition tosupporting industry-standard scalability and security features, it iseasy to integrate with existing systems and workflows. Signal Hub canalso have a data flow engine that is flexible to allow processing ofdifferent computing environments, languages, and frameworks. A multitarget system data flow compiler can generate code to deploy ondifferent target data flow engines utilizing different computerenvironments, languages, and frameworks. For applications with hardreturn on investment (ROI) metrics (e.g., churn reduction), faster timeto value can equate to millions of dollars earned. Additionally, thesystem could lower development costs as data science project timelinespotentially shrink, such as from 1 year to 3 months (e.g., a 75%improvement). Shorter development cycles and lower development costscould result in increased accessibility of data science to more parts ofthe business. Further, the system could reduce the total costs ofownership (TCO) for big data analytics.

FIG. 1 is a diagram illustrating hardware and software components of thesystem. The system 10 includes a computer system 12 (e.g., a server)having a database 14 stored therein and a Signal Hub engine 16. Thecomputer system 12 could be any suitable computer server or cluster ofservers (e.g., a server with an INTEL microprocessor, multipleprocessors, multiple processing cores, etc.) running any suitableoperating system (e.g., Windows by Microsoft, Linux, Hadoop, etc.). Thedatabase 14 could be stored on the computer system 12, or locatedexternally therefrom (e.g., in a separate database server incommunication with the system 10).

The system 10 could be web-based and remotely accessible such that thesystem 10 communicates through a network 20 with one or more of avariety of computer systems 22 (e.g., personal computer system 26 a, asmart cellular telephone 26 b, a tablet computer 26 c, or otherdevices). Network communication could be over the Internet usingstandard TCP/IP communications protocols (e.g., hypertext transferprotocol (HTTP), secure HTTP (HTTPS), file transfer protocol (FTP),electronic data interchange (EDI), etc.), through a private networkconnection (e.g., wide-area network (WAN) connection, emails, electronicdata interchange (EDI) messages, extensible markup language (XML)messages, file transfer protocol (FTP) file transfers, etc.), or anyother suitable wired or wireless electronic communications format.Further, the system 10 could be in communication through a network 20with one or more third party servers 28. These servers 28 could bedisparate “compute” servers on which analytics could be performed (e.g.,Hadoop, etc.). The Hadoop system can manage resources (e.g., splitworkload and/or automatically optimize how and where computation isperformed). For example, the system could be fully or partially executedon Hadoop, a cloud-based implementation, or a stand-alone implementationon a single computer. More specifically, for example, system developmentcould be executed on a laptop, and production could be on Hadoop, whereHadoop could be hosted in a data center.

FIGS. 2-3 are diagrams comparing traditional signal architecture 40 andnew data signal architecture 48 provided by the system. As shown, in thetraditional signal architecture 40 (e.g., the spaghetti architecture),for every new use case 46, raw data 42 is transformed through processingsteps 44, even if that raw data 42 had been previously transformed for adifferent use case 46. More specifically, a data element 42 must beprocessed for use in a first use case 46, and that same data elementmust be processed again for use in a second use case 46. In particular,the analytic code written to perform the processing steps 44 cannot beeasily re-used. Comparatively, in the new data signal architecture 48(e.g., the bowtie architecture) of the present disclosure, raw data 50is transformed into descriptive and predictive signals 52 only once.Advantageously, the analytic code generated by the system for eachsignal 52 can be rapidly developed, deployed, and re-used with many ofthe use cases 54.

Signals are key ingredients to solving an array of problems, includingclassification, regression, clustering (segmentation), forecasting,natural language processing, intelligent data design, simulation,incomplete data, anomaly detection, collaborative filtering,optimization, etc. Signals can be descriptive, predictive, or acombination thereof. For instance, Signal Hub can identify high-yieldcustomers who have a high propensity to buy a discounted ticket todestinations that are increasing in popularity. Descriptive signals arethose which use data to evaluate past behavior. Predictive signals arethose which use data to predict future behavior. Signals become morepowerful when the same data is examined over a (larger) period of time,rather than just an instance.

Descriptive signals could include purchase history, usage patterns,service disruptions, browsing history, time-series analysis, etc. As anexample, an airline trying to improve customer satisfaction may want toknow about the flying experiences of its customers, and it may beimportant to find out if a specific customer had his/her last flightcancelled. This is a descriptive signal that relies on flightinformation as it relates to customers. In this example, a new signalcan be created to look at the total number of flight cancelations agiven customer experienced over the previous twelve months. Signals canmeasure levels of satisfaction by taking into account how many times acustomer was, for instance, delayed or upgraded in the last twelvemonths.

Descriptive signals can also look across different data domains to findinformation that can be used to create attractive business deals and/orto link events over time. For example, a signal may identify a partnerhotel a customer tends to stay with so that a combined discounted deal(e.g., including the airline and the same hotel brand) can be offered toencourage the customer to continue flying with the same airline. Thisalso allows for airlines to benefit from and leverage the customer'ssatisfaction level with the specific hotel partner. In this way, rawinput data is consolidated across industries to create a specificrelationship with a particular customer. Further, a flight cancelationfollowed by a hotel stay could indicate that the customer got to thedestination but with a different airline or a different mode oftransportation.

Predictive signals allow for an enterprise to determine what a customerwill do next or how a customer will respond to a given event and thenplan appropriately. Predictive signals could include customer fading,cross-sell/up-sell, propensity to buy, price sensitivity, offerpersonalization, etc. A predictive signal is usually created with a usecase in mind. For example, a predictive signal could cluster customersthat tend to fly on red-eye flights, or compute the propensity level acustomer has for buying a business class upgrade.

Signals can be categorized into classes including sentiment signals,behavior signals, event/anomaly signals, membership/cluster signals, andcorrelation signals. Sentiment signals capture the collective prevailingattitude about an entity (e.g., consumer, company, market, country,etc.) given a context. Typically, sentiment signals have discretestates, such as positive, neutral, or negative (e.g., current sentimenton X corporate bonds is positive). Behavior signals capture anunderlying fundamental behavioral pattern for a given entity or a givendataset (e.g., aggregate money flow into ETFs, number of “30 days pastdue” in last year for a credit card account, propensity to buy a givenproduct, etc.). These signals are most often a time series and depend onthe type of behavior being tracked and assessed. Event/Anomaly signalsare discrete in nature and are used to trigger certain actions or alertswhen a certain threshold condition is met (e.g., ATM withdrawal thatexceeds three times the daily average, bond rating downgrade by a ratingagency), etc. Membership/Cluster signals designate where an entitybelongs, given a dimension. For example, gaming establishments createclusters of their customers based on spending (e.g., high rollers,casual gamers, etc.), or wealth management firms can create clusters oftheir customers based on monthly portfolio turnover (e.g., frequenttraders, buy and hold, etc.). Correlation signals continuously measurethe correlation of various entities and their attributes throughout atime series of values between 0 and 1 (e.g., correlation of stock priceswithin a sector, unemployment and retail sales, interest rates and GDP,home prices and interest rates, etc.).

Signals have attributes based on their representation in time orfrequency domains. In a time domain, a Signal can be continuous (e.g.,output from a blood pressure monitor) or discrete (e.g., daily marketclose values of the Dow Jones Index). Within the frequency domain,signals can be defined as high or low frequency (e.g., asset allocationtrends of a brokerage account can be measured every 15 minutes, daily,and monthly). Depending on the frequency of measurement, a signalderived from the underlying data can be fast-moving or slow-moving.

Signals are organized into signal sets that describe (e.g., relate to)specific business domains (e.g. customer management). Signal sets areindustry-specific and cover domains including customer management,operations, fraud and risk management, maintenance, networkoptimization, digital marketing, etc. Signal Sets could be dynamic(e.g., continually updated as source data is refreshed), flexible (e.g.,adaptable for expanding parameters and targets), and scalable (e.g.,repeatable across multiple use cases and applications).

FIGS. 4A-4B are diagrams illustrating the system in greater detail. Themain components of Signal Hub 60 include an integrated developmentenvironment (Workbench) 62, Knowledge Center (KC) 64, and Signal HubManager (“SHM”) 65, and Signal Hub Server 66. The Workbench 62 is anintegrated software-based productivity tool for data scientists anddevelopers, offering analytic functionalities and approaches for themaking of a complete analytic solution, from data to intelligence tovalue. The Workbench 62 enables scientists to more effectively transformdata to intelligence through the creation of signals. Additionally, theWorkbench 62 allows data scientists to rapidly develop and deployreusable analytic code for conducting analytics on various (often,disparate) data sources, on numerous computer platforms. The KnowledgeCenter 64 is a centralized place for institutional intelligence andmemory and facilitates the transformation of intelligence to valuethrough the exploration and consumption of signals. The Knowledge Center64 enables the management and reuse of signals, which leads toscalability and increased productivity. The Signal Hub manager 65provides a management and monitoring console for analytic operationalstewards (e.g., IT, business, science, etc.). The Signal Hub manager 65facilitates understanding and managing the production quality andcomputing resources with alert system. Additionally, the Signal Hubmanager 65 provides role-based access control for all Signal Hubplatform components to increase network security in an efficient andreliable way. The Signal Hub Server 66 executes analytics by running theanalytic code developed in the Workbench 62 and producing the Signaloutput. The Signal Hub Server 66 provides fast, flexible and scalableprocessing of data, code, and artifacts (e.g., in Hadoop via a data-flowexecution engine; Spark Integration). The Signal Hub Server 66 isresponsible for the end-to-end processing of data and its refinementinto signals, as well as enabling users to solve problems acrossindustries and domains (e.g., making Signal Hub a horizontal platform).

The platform architecture provides great deployment flexibility. It canbe implemented on a single server as a single process (e.g., a laptop),or it can run on a large-scale Hadoop cluster with distributedprocessing, without modifying any code. It could also be implemented ona standalone computer. This allows scientists to develop code on theirlaptops and then move it into a Hadoop cluster to process large volumesof data. The Signal Hub Server architecture addresses the industry needfor large-scale production-ready analytics, a need that popular toolssuch as SAS and R cannot fulfill even today, as their basic architectureis fundamentally main memory—limited.

Signal Hub components include signal sets, ETL processing, dataflowengine, signal-generating components (e.g., signal-generationprocesses), APIs, centralized security, model execution, and modelmonitoring. The more use cases that are executed using Signal Hub 60,the less time it takes to actually implement them over time because theanswers to a problem may already exist inside Signal Hub 60 after a fewrounds of signal creation and use case implementation. Signals arehierarchical, such that within Signal Hub 60, a signal array mightinclude simple signals that can be used by themselves to predictbehavior (e.g., customer behavior powering a recommendation) and/or canbe used as inputs into more sophisticated predictive models. Thesemodels, in turn, could generate second-order, highly refined signals,which could serve as inputs to business-process decision points.

The design of the system and Signal Hub 60 allows users to use a singlesimple expression that represents multiple expressions of differentlevels of data aggregations. For example, suppose there is a datasetwith various IDs. Each ID could be associated with an ID type whichcould also be associated with an occurrence of an event. One level ofaggregation could be to determine for each ID and each ID type, thenumber of occurrence of an event. A second level of aggregation could beto determine for each ID, what is the most common type of ID based onthe number of occurrence of an event. The system of the presentdisclosure allows this determination based on multiple layers ofaggregation to be based on a single scalar expression and returning oneexpected output at one time. For example, using the codecategory_histogram(col), the system will create a categorical histogramfor a given column, with each unique value in the column beingconsidered a category. Using the code “mode(histogram, n=1),” allows thesystem to return the category with the highest number of entries. Ifn>1, retrieve the n'th most common value (2nd, 3rd . . . ); if n<0,retrieve the least common value (n=−1); and second least common (n=−2)etc. In the event several keys have equal frequencies, the smallest (ifkeys are numerical) or earliest (if keys are alphabetical) are returned.The following an example of a sample input and output based on theforegoing example.

Input:

id type 1 A 1 A 1 A 1 B 2 B 2 B 2 C

Output:

Id Mode_1 1 A 2 B

FIG. 4C is a screenshot of an event pattern matching feature of thesystem of the present disclosure. The system allows users to determinewhether a specified sequence of events occurred in the data and thensubmit a query to retrieve information about the matched data. Forexample, in FIG. 4C, for the raw input data shown, a user can (1) definean event; (2) create a pattern matcher; and (3) query the patternmatcher to return the output as shown. As can be seen, a user can easilydefine with a regular expression an occurrence of a specified event suchas “service fixed after call.” Once the pattern matches algorithm isexecuted, a signal is extracted in the output showing the patternoccurrence.

FIG. 5 is a screenshot illustrating a Workbench 70 generated by thesystem. The Workbench 70 (along with the Knowledge Center) enables usersto interact with the functionality and capabilities of the Signal Hubsystem via a graphical user interface (GUI). The Workbench 70 is anenvironment to develop end-to-end analytic solutions (e.g., adevelopment environment for analytics) including reusable and easilydeveloped analytic code. It offers all the necessary functionality foraggregating of the entire analytic modeling process, from data tosignals. It provides an environment for the coding and development ofdata schemas, data quality management processes (e.g. missing valueimputation and outlier detection), collections (e.g., the gathering ofraw data files with the same data schema), views (e.g., logic to createa new relational dataset from other views or collections), descriptiveand predictive signals, model validation and visualization (e.g.,measuring of model performance through ROC (receiver operatorcharacteristic), KS (Kolmogorov-Smirnov), Lorenz curves, etc.),visualization and maintenance of staging, input, output data models,etc. The Workbench 70 facilitates data ingestion and manipulating, aswell as enabling data scientists to extract intelligence and value fromdata through signals (e.g., analytics through signal creation andcomputation).

The user interface of the Workbench could include components such as atree view 72, an analytic code development window 74, and asupplementary display portion 76. The tree view 72 displays eachcollection of raw data files (e.g., indicated by “Col” 73 a) as well aslogical data views (e.g., indicated by “Vw” 73 b), as well asthird-party code called as user defined functions if any (e.g., python,R, etc.). The analytic code development window 74 has a plurality oftabs including Design 78, Run 80, and Results 82. The Design tab 78provides a space where analytic code can be written by the developer.The Run tab 80 allows the developer to run the code and generate signalsets. Finally, the Results tab 82 allows the developer to view the dataproduced by the operations defined in the Run tab 80.

The supplementary display portion 76 could include additionalinformation including schemas 84 and dependencies 86. Identifying,extracting, and calculating signals at scale from noisy big datarequires a set of predefined signal schema and a variety of algorithms.A signal schema is a specific type of template used to transform datainto signals. Different types of schema may be used, depending on thenature of the data, the domain, and/or the business environment. Initialsignal discovery could fall into one or more of a variety of problemclasses (e.g., regression classification, clustering, forecasting,optimization, simulation, sparse data inference, anomaly detection,natural language processing, intelligent data design, etc.). Solvingthese problem classes could require one or more of a variety of modelingtechniques and/or algorithms (e.g., ARMA, CART, CIR++, compression nets,decision trees, discrete time survival analysis, D-Optimality, ensemblemodel, Gaussian mixture model, genetic algorithm, gradient boostedtrees, hierarchical clustering, kalman filter, k-means, KNN, linearregression, logistic regression, Monte Carlo Simulation, Multinomiallogistic regression, neural networks, optimization (LP, IP, NLP),poisson mixture model, Restricted Boltzmann Machine, Sensitivity trees,SVD, A-SVD, SVD++, SVM, projection on latent structures, spectral graphtheory, etc.).

Advantageously, the Workbench 70 provides access to pre-definedlibraries of such algorithms, so that they can be easily accessed andincluded in analytic code being generated. The user then can re-useanalytic code in connection with various data analytics projects. Bothdata models and schemas can be developed within the Workbench 70 orimported from popular third-party data modeling tools (e.g., CA Erwin).The data models and schemas are stored along with the code and can begoverned and maintained using modern software lifecycle tools.Typically, at the beginning of a Signal Hub project, the Workbench 70 isused by data scientists for profiling and schema discovery of unfamiliardata sources. Signal Hub provides tools that can discover schema (e.g.,data types and column names) from a flat file or a database table. Italso has built-in profiling tools, which automatically compute variousstatistics on each column of the data such as missing values,distribution parameters, frequent items, and more. These built-in toolsaccelerate the initial data load and quality checks.

Once data is loaded and discovered, it needs to be transformed from itsraw form into a standard representation that will be used to feed thesignals in the signal layer. Using the Workbench 70, data scientists canbuild workflows composed of “views” that transform the data and applydata quality checks and statistical measures. The Signal Hub platformcan continuously execute these views as new data appears, thus keepingthe signals up to date.

The dependencies tab 86 could display a dependency diagram (e.g., agraph) of all the activities comprising the analytic project, asdiscussed below in more detail. A bottom bar 88 could include compilerinformation, such as the number of errors and warnings encountered whileprocessing views and signal sets.

FIG. 6 is a diagram 90 illustrating use cases (e.g., outputs, signals,etc.) of the system. There could be multiple signal libraries, each withsubcategories for better navigation and signal searching. For example,as shown, the Signal Hub could include a Customer Management signallibrary 92. Within the Customer Management Signal Library 92 aresubcategories for Flight 94, Frequent Flyer Program 96, Partner 98, andAncillary 99. The Flight subcategory 94 could include, for example,“Signal 345. Number of times customer was seated in middle seat in thepast 6 months,” “Signal 785. Number of trips customer has made on aweekend day in past 1 year,” “Signal 956. Number of flights customerwith <45 mins between connections,” “Signal 1099. Indicates a customerhas been delayed more than 45 minutes in last 3 trips,” “Signal 1286.Number of involuntary cancellations experienced by the customer in past1 year,” etc. The Frequent Flyer Program subcategory 96 could include,for example, “Signal 1478. % of CSat surveys taken out of total flightscustomer has flown in past 1 month,” “Signal 1678. Number ofcomplimentary upgrades a member received in past 6 months,” “Signal2006. Ratio of mileage earned to mileage used by a member in past 1year,” “Signal 2014. Average # of days before departure when an upgraderequest is made by member,” “Signal 2020. Number upgrades redeemed usingmileage in past 1 year,” etc. The Partner subcategory 98 could include,for example, “Signal 563. Mileage earned using Cable Company (TM) inpast 1 month,” “Signal 734. Number of partners with whom that customerhas engaged in the past 6 months,” “Signal 737. Mileage earned viaRental Car in past 1 yr,” “Signal 1729. Number of emails received aboutLuxury Hotel in the past 3 months,” “Signal 1993. Number of timescustomer booked hotel with Airlines' partner without booking associatedflight in the past 1 year,” etc. The Ancillary subcategory 99 couldinclude, for example, “Signal 328. Number of times customer has hadbaggage misplaced in past 3 months,” “Signal 1875. Total amount spent oncheck bags in past 1 month,” “Signal 1675. Number of times wifi wasunavailable on customer's flight,” “Signal 1274. Number of emailsreceived pertaining to bags in last 1 year,” “Signal 1564. Number oftimes customer has purchased duty free on board,” etc.

FIG. 46 is a schematic diagram illustrating an example signal creationlayer methodology in accordance with one exemplary embodiment. Signalscan be created through various combinations and permutations ofcategories. In this example, signals can be created through variouscombinations and permutations of categories including entity,transformation, attribute and time frame. Thus, one exemplarydescriptive signal is a household's count of trips to the store in thepast 1 year, and one exemplary predictive signal is a card holder'sprojected revenue in a particular product category in the future 6months. Signals can be based on scalar functions (e.g., functions thatcompute a value based on a single record) or aggregate functions (e.g.,functions that need to pass over the entire dataset in order to computea value, such as to compute, say, average household income across allhouseholds). Such signals can be created automatically and can beupdated automatically as source data is received and processed. Thesignals can be provided as a signal layer for use by variousapplications.

FIG. 7 is a diagram illustrating analytic model development anddeployment carried out by the system. In step 202, a user defines abusiness requirement (e.g., business opportunity, business problem)needing analyzing. In step 204, one or more analytics requirements aredefined. In step 214, the user searches for signals, and if anappropriate signal is found, the user selects the signal. If a signal isnot found, then in step 212, the user creates one or more signals byidentifying the aggregated and cleansed data to base the signal on.After the signal is created, the process then proceeds to step 214. Ifthe raw data is not available to create the signal in step 212, then instep 208 the user obtains the raw data, and in step 210, the data isaggregated and cleansed, and then the process proceeds to step 212. Itis noted that the system of the present disclosure facilitates skippingsteps 208-212 (unlike the traditional approach which must proceedthrough such steps for every new business requirement).

Once the signals are selected, then in step 216, solutions and modelsare developed based on the signals selected. In step 218, results areevaluated and if necessary, signals (e.g., created and/or selected)and/or solutions/models are revised accordingly. Then in step 220, thesolutions/models are deployed. In step 222, results are monitored andfeedback gathered to incorporate back into the signals and/orsolutions/models.

FIG. 8 is a diagram 250 illustrating hardware and software components ofthe system in one implementation. Other implementations could beimplemented. The workflow includes model-building tools 252, Hadoop/YARNand Signal Hub processing steps 254, and Hadoop Data Lake (HadoopDistributed file system (HDFS) and HIVE) databases 256.

The Signal Hub Server is able to perform large-scale processing ofterabytes of data across thousands of Signals. It follows a data-flowarchitecture for processing on a Hadoop cluster (e.g., Hadoop 2.0).Hadoop 2.0 introduced YARN (a large-scale, distributed operating systemfor big data applications), which allows many different data processingframeworks to coexist and establishes a strong ecosystem for innovatingtechnologies. With YARN, Signal Hub Server solutions are nativecertified Hadoop applications that can be managed and administeredalongside other applications. Signal Hub users can leverage theirinvestment in Hadoop technologies and IT skills and run Signal Hubside-by-side with their current Hadoop applications.

Raw data is stored in the raw data database 258 of the Hadoop Data Lake256. In step 260, Hadoop/Yarn and Signal Hub 254 process the raw data258 with ETL (extract, transform, and load) modules, data qualitymanagement modules, and standardization modules. The results of step 260are then stored in a staging database 262 of the Hadoop Data Lake. Instep 260, Hadoop/Yarn and Signal Hub 254 process the staging data 262with signal calculation modules, data distribution modules, and samplingmodules. The results of step 264 are then stored in the Signals andModel Input database 266. In step 268, the model development andvalidation module 268 of the model building tools 252 processes thesignals and model input data 266. The results of step 268 are thenstored in the model information and parameters database 270. In step272, the model execution module 272 of the Hadoop/Yarn and Signal Hub254 processes signals and model input data 266 and/or model informationand parameters data 270. The results of step 272 are then stored in themodel output database 274. In step 276, the Hadoop/Yarn and Signal Hub254 processes the model output data 274 with a business rules executionoutput transformation for business intelligence and case management userinterface. The results of step 276 are then stored in the final outputdatabase 278. Enterprise applications 280 and business intelligencesystems 282 access the final output data 278, and can provide feedbackto the system which could be integrated into the raw data 258, thestaging data 262, and/or the signals and model input 266.

The Signal Hub Server automates the processing of inputs to outputs.Because of its data flow architecture, it has a speed advantage. TheSignal Hub Server has multiple capabilities to automate servermanagement. It can detect data changes within raw file collections andthen trigger a chain of processing jobs to update existing Signals withthe relevant data changes without transactional system support.

FIGS. 9-10 are diagrams illustrating hardware and software components ofthe system during development and production. More specifically, FIG. 9is a diagram 300 illustrating hardware and software components of thesystem during development and production. Source data 302 is inelectrical communication with Signal Hub 304. Signal Hub 304 comprises aWorkbench 306, and a Knowledge Center 308. Signal Hub 304 could alsoinclude a server in electronic communication with the Workbench 306 andthe Knowledge Center 308, such as via Signal Hub manager 312. Signal Hubfurther comprises infrastructure 314 (e.g., Hadoop, YARN, etc.) andhosting options 316, such as Client, Opera, and Virtual Cloud (e.g.,AWS).

Signal Hub 304 allows companies to absorb information from various datasources 302 to be able to address many types of problems. Morespecifically, Signal Hub 304 can ingest both internal and external dataas well as structured and unstructured data. As part of the Hadoopecosystem, the Signal Hub Server can be used together with tools such asSqoop or Flume to digest data after it arrives in the Hadoop system.Alternatively, the Signal Hub Server can directly access any JDBC (JavaDatabase Connectivity) compliant database or import various data formatstransferred (via FTP, SFTP, etc.) from source systems.

Signal Hub 304 can incorporate existing code 318 coded in various (oftennon-compatible) languages (e.g., Python, R, Unix Shell, etc.), calledfrom the Signal Hub platform as user defined functions. Signal hub 304can further communicate with modeling tools 320 (e.g., SAS, SPSS, etc.),such as via flat file, PMML (Predictive Model Markup Language), etc. ThePMML format is a file format describing a trained model. A modeldeveloped in SAS, R, SPSS, or other tools can be consumed and run withinSignal Hub 304 via the PMML standard. Advantageously, such a solutionallows existing analytic code that may be written in various,non-compatible languages (e.g., SAS, SPSS, Python, R, etc.) to beseamlessly converted and integrated for use together within the system,without requiring that the existing code be re-written. Additionally,Signal Hub 304 can create tests and reports as needed. Through theWorkbench, descriptive signals can be exported into a flat file for thetraining of predictive models outside Signal Hub 304. When the model isready, it can then be brought back to Signal Hub 304 via the PMMLstandard. This feature is very useful if a specific machine-learningtechnique is not yet part of the model repertoire available in SignalHub 304. It also allows Signal Hub 304 to ingest models created byclients in third-party analytic tools (including R, SAS, SPSS). The useof PMML allows Signal Hub users to benefit from a high level ofinteroperability among systems where models built in any PMML-compliantanalytics environment can be easily consumed. In other words, becausethe system can automatically convert existing (legacy) analytic codemodules/libraries into a common format that can be executed by thesystem (e.g., by automatically converting such libraries intoPMML-compliant libraries that are compatible with other similarlycompliant libraries), the system thus permits easy integration andre-use of legacy analytic code, interoperably with other modulesthroughout the system.

Signal Hub 304 integrates seamlessly with a variety of front-end systems322 (e.g., use-case specific apps, business intelligence, customerrelationship management (CRM) system, content management system,campaign execution engine, etc.). More specifically, Signal Hub 304 cancommunicate with front end systems 322 via a staging database (e.g.,MySQL, HIVE, Pig, etc.). Signals are easily fed into visualization tools(e.g. Pentaho, Tableau), CRM systems, and campaign execution engines(e.g. Hubspot, ExactTarget). Data is transferred in batches, written toa special data landing zone, or accessed on-demand via APIs (applicationprogramming interfaces). Signal Hub 304 could also integrate withexisting analytic tools, pre-existing code, and models. Client code canbe loaded as an external library and executed within the server. All ofthis ensures that existing client investments in analytics can be reusedwith no need for recoding.

The Workbench 306 could include a workflow to process signals thatincludes loading 330, data ingestion and preparation 332, descriptivesignal generation 336, use case building 338, and sending 340. In theloading step 330, source data is loaded into the Workbench 306 in any ofa variety of formats (e.g., SFTP, JDBC, Sqoop, Flume, etc.). In the dataingestion and preparation step 332, the Workbench 306 provides theability to process a variety of big data (e.g., internal, external,structured, unstructured, etc.) in a variety of ways (e.g., deltaprocessing, profiling, visualizations, ETL, DQM, workflow management,etc.). In the descriptive signal generation step 334, a variety ofdescriptive signals could be generated (e.g., mathematicaltransformations, time series, distributions, pattern detection, etc.).In the predictive signal generation step 336, a variety of predictivesignals could be generated (e.g., linear regression, logisticregression, decision tree, Naïve Bayes, PCA, SVM, deep autoencoder,etc.). In the use case building step 338, uses cases could be created(e.g., reporting, rules engine, workflow creator, visualizations, etc.).In the sending step 340, the Workbench 306 electronically transmits theoutput to downstream connectors (e.g., APIs, SQL, batch file transfer,etc.).

FIG. 10 is a diagram 350 illustrating hardware and software componentsof the system during production. As discussed in FIG. 9, Signal Hubincludes a Workbench 352, a Knowledge Center 354, and a Signal HubManager 356. The Workbench 352 could communicate with an execution layer360 via a compiler 358. The Knowledge Center 354 and Signal Hub manager356 could directly communicate with the execution layer 360. Theexecution layer 360 could include a workflow server 362, a plurality offlexible data flow engines 364, and an operational graph database 366.Signal Hub further comprises infrastructure 366 (e.g., Hadoop, YARN,etc.) and hosting options 370, such as Client, Opera, and VirtualPrivate Cloud (e.g., AWS, Amazon, etc.). The plurality of flexible dataflow engines 364 can have the latest cutting-edge technology.

FIGS. 11-17 are screenshots illustrating use of the Signal Hub platformto create descriptive signals. The Workbench user interface 500 includesa tree view 502 and an analytic code development window 504. TheWorkbench provides direct access to the Signal API, which speeds updevelopment and simplifies (e.g., reduce errors in) signal creation(e.g., descriptive signals). The Signal API provides an ever-growing setof mathematical transformations that will allow for the creation ofpowerful descriptive signals, along with a syntax that is clear,concise, and expressive. Signal API allows scientists to veer away fromthe implementation details and focus solely on data analysis, thusmaximizing productivity and code reuse. For example, the Signal APIallows for easy implementation of complex pattern-matching signals. Forexample, for the telecom industry, one pattern could be a sequence ofevents in the data that are relevant for measuring attrition, such as awidespread service disruption followed by one or more customercomplaints followed by restored service. The Signal API also provides adirect link between the Workbench and the Knowledge Center. Users canadd metatags and descriptions to signals directly in Signal API code(which is reusable analytic code). These tags and taxonomy informationare then used by the Knowledge Center to enable signal search and reuse,which greatly enhances productivity.

As for predictive signals, training and testing of models can easily bedone in the Workbench through its intuitive and interactive userinterface. Current techniques available for modeling and dimensionalityreduction include SVMs, k-means, decision trees, association rules,linear and logistic regression, neural networks, RBM (machine-learningtechnique), PCA, and Deep AutoEncoder (machine-learning technique) whichallows data scientists to train and score deep-learning nets. Some ofthese advanced machine-learning techniques (e.g., Deep AutoEncoder andRBM) project data from a high-dimensional space into a lower-dimensionalone. These techniques are then used together with clustering algorithmsto understand customer behavior.

FIG. 11 is a screenshot illustrating data profiles for each column(e.g., number of unique, number of missing, average, max, min, etc.)using the Workbench 500 generated by the system. As described above, theWorkbench user interface could include sets of components including atree view 502, an analytic code development window 504, and asupplementary display portion 506. The analytic code development window504 includes a design tab 508, which provides a user with the ability tochoose a format, name, file pattern, schema, header, and/or fieldseparator. Signal Hub supports various input file formats includingdelimited, fixed width, JDBX, xml, excel, log file, etc. A user can loaddata from various data sources. More specifically, parameterizeddefinitions allow a user to load data from a laptop, cluster, and/orclient database system. The supplementary display portion 506 includes aYAML tab 510, a Schema tab 512, and a dependencies tab 514. The YAML tab510 includes a synchronized editor so that a user can develop the codein a graphical way or in a plain text format, where these two formatsare easily synchronized.

FIG. 12 is a screenshot illustrating profiling of raw data using theWorkbench 500 generated by the system. The analytic code developmentwindow 504 includes a design tab 508, a run tab 520, and a results tab522. The design tab 508 is activated, and within the design tab 508 area plurality of other tabs. More specifically, the design tab 508includes a transformations tab 524, a measures tab 526, a models tab528, a persistence tab 530, a meta tab 532, and a graphs tab 534. Themeasures tab 526 is activated, thereby allowing a user to add a measurefrom a profiling library, such as from a drop down menu. The profilinglibrary offers data profiling tools to help a user understand the data.For example, profiling measures could include basicStats, contingencyTable, edd (Enhanced Data Dictionary), group, histogram, monotonic,percentiles, woe, etc. The edd is a data profiling capability whichanalyzes content of data sources.

FIG. 13 is a screenshot illustrating displaying of specific entrieswithin raw data using the Workbench 500 generated by the system. Theanalytic code development window 504 includes a table 540 showingspecific data entries for the measure “edd”, as well as a plurality ofcolumns pertaining to various types of information for each data entry.More specifically, the table 540 includes columns directed to obs, name,type, nmiss, pctMissing, unique, stdDev, mean_or_top1, min_or_top2, etc.The table 540 includes detailed data statistics including number ofrecords, missing rate, unique values, percentile distribution, etc.

FIG. 14 is a screenshot illustrating aggregating and cleaning of rawdata using the Workbench 500 generated by the system. As shown, theanalytic code development window 504 has the transformations tab 524activated. The transformation tab 524 is directed to the transformationlibrary which allows users to do various data aggregation and cleaningwork before using data. In the transformations tab 524, the user can addone or more transformations, such as cubePercentile, dedup, derive,filter, group, join, limitRows, logRows, lookup, etc. FIG. 15 is ascreenshot illustrating managing and confirmation of raw data qualityusing the Workbench 500 generated by the system. As shown, the analyticcode development window 504 has the transformations tab 524 activated. Auser can gather more information about each transformation, such asshown for Data Quality. The data quality management uses a series ofchecks which contains a predicate, an action, and an optional list offields to control and manage the data quality.

FIG. 16 is a screenshot illustrating auto-generated visualization of adata model created using the Workbench 500. This visualization could beautomatically generated from YAML code (e.g., the code that reads anddoes initial linking and joining of data). As shown, analytic codedevelopment window 504 allows a user to view relations and interactionsbetween various data elements. The data model organizes data elementsinto fact and dimension tables and standardizes how the data elementsrelate to one another. This could be automatically generated in SignalHub after loading the data. FIG. 17A is a screenshot illustratingcreation of reusable analytic code using the Workbench 500 generated bythe system. As shown, the analytic code development window 504 includesmany lines of code that incorporate and utilize the raw data previouslyselected and prepared. The Signal API could be scalable and easy to use(e.g., for loop signals, peer comparison signals, etc.). Further, SignalHub could provide signal management by using @tag and @doc to specifysignal metadata and description, which can be automatically extractedand displayed in the Knowledge Center. FIG. 17B is a screenshotillustrating the graphical user interface of Signal API in Workbench.Similar to excel, users can select from a function list 524 and a columnlist 526 to create new signals with a description 528 and example codeprovided at the bottom. Users can use Signal API either in a plain textformat or in a graphical way, where these two formats are easilysynchronized.

FIGS. 18-23 are screenshots illustrating user interface screensgenerated by the system using the Knowledge Center 600 to find and use asignal. As an integral part of Signal Hub, the Knowledge Center could beused as an interactive signal management system to enable modeldevelopers and business users to easily find, understand, and reusesignals that already exist in the signal library inside Signal Hub. TheKnowledge Center allows for the intelligence (e.g., signals) to beaccessed and explored across use cases and teams throughout theenterprise. Whenever a new use case needs to be implemented, theKnowledge Center enables relevant signals to be reused so that theirintrinsic value naturally flows toward the making of a new analyticsolution that drives business value.

Multiple features of the Knowledge Center facilitate accessing andconsuming intelligence. The first is its filtering and searchingcapabilities. When signals are created, they are tagged based onmetadata and organized around a taxonomy. The Knowledge Center empowersbusiness users to explore the signals through multiple filtering andsearching mechanisms.

Key components of the metadata in each signal include the businessdescription, which explains what the signal is (e.g., number of times acustomer sat in the middle seat on a long-haul flight in the past threeyears). Another key component of the metadata in each signal is thetaxonomy, which shows each signal's classification based on its subject,object, relationship, time window, and business attributes (e.g.,subject=customer, object=flight, relationship=count, time window=singleperiod, and business attributes=long haul and middle seat).

The Knowledge Center facilitates exploring and identifying signals basedon this metadata when executing use cases by using filtering andfree-text searching. The Knowledge Center also allows for a completevisualization of all the elements involved in the analytical solution.Users can visualize how data sources connect to models through a varietyof descriptive signals, which are grouped into Signal Sets depending ona pre-specified and domain-driven taxonomy. The same interface alsoallows users to drill into specific signals. Visualization tools canalso allow a user to visualize end-to-end analytics solution componentsfrom the data, to the signal and finally to the use-cases. The systemcan automatically detect the high level lineage between the data, signaland use-cases when hovering over specific items. The system can alsoallow a user to further drill down specific data, signal and use-casesby predefined metadata which can also allow a user to view the highlevel lineage as well.

FIG. 18 is a screenshot illustrating a user interface screen generatedby the system for visualizing signal paths using the Knowledge Center600 generated by the system. As shown, the Signal Hub platform 600includes a side menu 602 which allows a user to filter signals, such asby entering a search description into a search bar, or by browsingthrough various categories (e.g., business attribute, window, subject,object, relationship, category, etc.). The Signal Hub platform 600further includes a main view portion 604. The main view portion 604diagrammatically displays data sources 606 (e.g., business inputs),descriptive signals 608 (e.g., grouped and organized by metadata), andpredictive signals 610. The descriptive signals 608 include a wheel oftabs indicating categories to browse in searching for a particularsignal. For example, the categories could include route, flight, hotel,etc. Once a particular category is selected in the descriptive signals608, the center of the descriptive signals 608 displays informationabout that particular category. For example, when “route” is chosen, thesystem indicates to the user that there are 23 related terms, 4 signalsets, and 536 signals.

The Signal Hub platform 600 also displays all the data sources that arefed into the signals of the category chosen. For example, for the“route” category, the data sources include event mater, customer,clickthrough, hierarchy, car destination, ticket coupon, non-flightdelivery item, booking master, holiday hotel destination, customer,ancillary master, customer membership, ref table: station pair, table:city word cloud, web session level, ref table: city info, ref table:country code, web master, redemption flight items, email notification,gold guest list, table: station pair info, customer account tcns,service recovery master, etc. A user can then choose one or more ofthese data sources to further filter the signals (and/or to navigate tothose data sources for additional information).

The Signal Hub platform 600 also displays all the models that utilizethe signals of the category chosen. For example, for the “route”category, the predictive signals within that category include hotelpropensity, destination propensity, pay-for-seat propensity, upgradepropensity, etc. A user can then choose one or more of these predictivesignals.

FIG. 19 is a screenshot illustrating a user interface screen generatedby the system for visualizing a particular signal using the KnowledgeCenter 600 generated by the system. As shown, the particular descriptivesignal “bkg_avg_mis_gh_re_v_1y_per_dest” at an individual level, thedata sources 606 that feed into that signal include “ancillary master,”“booking master,” and “ref table: station pair,” and the predictivesignals that use that descriptive signal include “hotel propensity,”“pay-for-seat-propensity,” and “destination propensity.”

FIG. 20A is a screenshot illustrating a user interface screen generatedby the system for finding a signal using the knowledge center 600generated by the system. The main view portion 604 includes a signaltable listing all existing signals with summary information (e.g.,loaded 100 of 2851 signals) for browsing signals and their relatedinformation. The table includes the signal name, signal description,signal tags, signal set, signal type (e.g., Common:Real, Common:Long,etc.), and function. The signal description is an easy to understandbusiness description (e.g., average number of passengers per tripcustomer travelled with). A user could also conduct a free text searchto identify a signal description that contains a specific word (e.g.,hotel signals). Further, a metadata filter could identify signals thatfit within certain metadata criteria (e.g., signals that calculate anaverage). FIG. 20B is a screenshot illustrating a user interface screengenerated by the system for finding a signal using the knowledge center600 generated by the system. Users are first asked to select apre-defined signal subject from “Search Signal” dropdown list to startthe signal search process. The main view portion 604 includes a signaltable listing all existing signals with summary information (e.g.,filtered conditions applied; loaded 100 of 2851 signals) for browsingsignals and their related information. The table includes the signaldescription, signal type (e.g., Real, Long, etc.), update time, refreshfrequency, etc. The signal description is an easy to understand businessdescription (e.g., average number of passengers per trip customertravelled with). A user could also define search columns (e.g.,description) and conduct a free text search within the search columnsthat contains a specific word (e.g., hotel signals). Further, a metadatafilter could identify signals that fit within certain metadata criteriaas shown in the left side panel (e.g., signals that calculate anaverage).

FIG. 21A is a screenshot illustrating a user interface screen generatedby the system for selecting entries (e.g., customers) with particularsignal values using the Knowledge Center 600 generated by the system.Users are also able to apply business rules to signals to filter thedata and target subsections of the population. For example, the user maywant to identify all customers with a propensity to churn that isgreater than 0.7 and those who have had two or more friends churn in thelast two weeks. This is particularly important as it enables businessusers to build sophisticated prescriptive models allowing truedemocratization of big data analytics across the enterprise. Morespecifically, a user can select signals to limit the table to onlysignals necessary to execute the specific use case (e.g., Signal:“cmcnt_trp_oper_led_abdn”). The table 618 also provides for the abilityto apply rules to filter the table to include only data that fits withinthe thresholds (e.g., customers with a hotel propensity score >0.3). Forexample, the table 618 includes the columns “matched_party_id” 620,“cmcnt_trp_oper_led_abdn” 622, “cmbin_sum_seg_tvl_rev_p1y” 624,“cmavg_mins_dly_p3m” 626, “SILENT_ATTRITION” 628. A user can narrow thesearch for a signal by indicating requirements for each column. Forexample, a user can request to see all signals that have acmbin_sum_seg_tvl_rev_p1y of =“g. 5000-10000” and acmavg_mins_dly_p3mof >5. A user can also apply more complextransformation on top the signals with standard SQL query language.Further, as shown in FIG. 21B, the Signal Hub platform 600 can schedulethe business report at regular basis (e.g., daily, weekly, monthly,etc.) using a reporting tool 630 to gain recurring insights or exportthe filtered data to external systems (e.g., CSV file into client'scampaign execution engine). The system of the present disclosure canalso include a reporting tool implemented in a Hadoop environment. Theuser can generate a report and query various reports. Further, the usercan query a single signal table and view the result in real-time. Stillfurther, the reporting tool can include a query code and a data tablefully listed out in the same page so users are able to switch betweendifferent steps easily and view the result for previous step.

FIG. 21C is a screenshot illustrating a user interface screen generatedby the system for displaying dashboard created using the KnowledgeCenter 600 generated by the system. A user is able to create varioustypes of graphs (e.g. line chart, pie chart, scattered 3D chart, heatmap, etc.) in the Knowledge Center and populate dashboard with graphscreated in certain layout. Dashboard will get refreshed automatically asthe backend data get refreshed. A user can also export the dashboard toexternal system. FIG. 21D is a screenshot illustrating a user interfacescreen generated by the system for exploring data dictionary createdusing the Knowledge Center 600 generated by the system. A user is ableto learn all the data input tables used the solution, with name,description, metadata, columns, and refresh rate information for eachdata input table. A user can also further explore individual data inputtable and learn the meaning of each column in the table. The Signal Hubplatform collects and centralizes all the siloed (stored) data knowledgetogether via data dictionary and makes it accessible and reusable forall the users. FIG. 21E is a screenshot illustrating a user interfacescreen generated by the system for exploring models created using theKnowledge Center 600 generated by the system. A user is able to learnall the models created in the solution and explore individual model indepth. The Signal Hub platform can display model description, metadata,input signal, output column, etc. all in one centralized page for eachmodel. FIG. 21E also illustrates a user interface screen generated bythe system for commenting signals using the Knowledge Center 600generated by the system. Users can comment on a signal via KnowledgeCenter user interface directly to express interest on a signal, proposepotential use case for the signal, or validate the signal value. TheSignal Hub platform allows users to interact with each other andexchange ideas. FIG. 21F is a screenshot generated by the system whichillustrates the charts that could be generated by the system. The chartscould be a representation of a signal or multiple signals. The types ofcharts could include, but is not limited to, bar charts, line charts,density charts, pie charts, bar graphs, or any other chart known tothose of ordinary skill in the art. Further, as shown, multiple chartscould be included in the dashboard for comparing and viewing differentcharts simultaneously.

FIG. 22 is a screenshot illustrating a user interface screen generatedby the system for visualizing signal parts of a signal using theKnowledge Center 600 generated by the system. Shown is a table showingvarious signals of a signal set. Users can isolate exactly which columnsin the raw data or other signals were combined to create the signal ofinterest. The Signal Hub platform 600 can display the top level diagram650, the definition level diagram 652, the predecessors 654, raw data656, consumers 658, definition 660, schema 62, and metadata 664 andstats. The predecessors tab is used to understand the raw data columnsand signals that are used to create a specific signal (e.g.,txh_mst_rx_cnt_txn_on1) and can be used to track the detailed signalcalculation step by step. When the predecessors tab is selected theresulting table can have one or more columns. For example, the tablecould include a column 670 of names of the signals within the signal set(e.g., within signal set “signals.signals_pos_txn_mst_04_app”), as wellas the formula 672, and what the signal is defined in 674.

FIG. 23A is a screenshot illustrating a user interface screen generatedby the system for visualizing a lineage of a signal using the KnowledgeCenter 600 generated by the system. The lineage is used to understandthe transformation from raw data to descriptive signals and predictivesignals (e.g., how is the number of trips required to move to the nextloyalty tier signal generated and which models consume it). As shown,when the definition level diagram button 652 is activated, the SignalHub platform 600 displays the lineage of a particular signal, whichincludes what data is being pulled, and what models the signal is beingused in. Once a signal of interest is identified, users can gain adeeper understanding of the signal by exploring its lineage from the rawdata through all transformations, providing insight into how aparticular Signal was created and what the value truly represents. Theycan identify which signals, if any, consume the signal of interest andview the code that was used to define it. FIG. 23B is a screenshotillustrating a user interface screen generated by the system fordisplaying signal values stats and visualization of signal valuedistribution. Both features provide a better understanding of signals,helps scientists determine what codes need to be evoked in theproduction system to calculate the signal, and makes signal managementeasier and faster. The Knowledge Center contains visualizationcapabilities to allow users to explore the values of signals directly inthe Signal Hub platform 600.

FIGS. 24-29 are screenshots illustrating using the Workbench 700generated by the system to create predictive signals (models) withAnalytic Wizard module. Analytic Wizard streamlines model developmentprocess with predefined steps and parameter presets. More specifically,FIG. 24A is a screenshot illustrating preparation of data to train amodel using the Workbench 700 generated by the system. As shown, theWorkbench 700 includes a tree view 702, and an analytic code developmentwindow 704 which includes a design tab 708, run tab 710, and results tab712. The design tab 708 is activated, and within the design tab 708 area plurality of other tabs. More specifically, the design tab 708includes a transformations tab 714, a measures tab 716, a models tab718, a persistence tab 720, a meta tab 722, and a graphs tab 724. SignalHub offers several ways to split train and test data for modeldevelopment purposes. The supplementary display portion 706 includes aYAML tab 726, a schema tab 728, and a dependencies tab 730. Signal Hubperforms missing value imputation, normalization, and other necessarysignal treatment before training the model, as shown in thesupplementary display portion 706. Once a model has been selected, moreinformation regarding the model is easily accessible, such as thedescription and model path. A user can also train an external modelusing any desired analytic tool. As long as the model output conforms toa standard pmml format, the Signal Hub platform can incorporate themodel result and do the scoring later. FIG. 24B is a screenshotillustrating an alternative embodiment as to how users can select from avariety of model algorithms (e.g., logistic regression, deepautoencoder, etc.). As shown, the Workbench 700 can include a tab 703for displaying a variety of signals. The Workbench 700 can include aselection means 732 for selecting a model algorithm. The selection means732 can be a drop down menu or similar means known to those of ordinaryskill in the art. FIG. 24C is a screenshot illustrating the differentparameter experiments users can apply during the model training process.Signal Hub also allows user to configure execution of models withparameter pre-sets that optimize speed or optimize accuracy as executionsteps. FIG. 24D is a screenshot illustrating how data preparation can behandled during the model training process. For example, missing valuescan be replaced with a median value. Furthermore, a normalization methodcan be applied to the data training. FIG. 24E is a screenshotillustrating how dummy variables can be introduced to facilitate themodel training process. FIG. 24F is a screenshot illustrating thedimensional reduction that can be applied to the model training process.For example, a variance threshold can be introduced and the number ofdimensions can be specified to further improve the model trainingaccuracy. FIG. 24G is a screenshot illustrating the data splittingaspect of the model training process. For example, a splitting methodcan be chosen such as cross-fold validation or any other data splittingmethod known to those of ordinary skill in the art. Furthermore, thenumber of folds, seed, percent of validation, and the stratified fieldcan be specified. FIG. 24H is a screenshot illustrating the measure tabwhich allows graph names to be specified along with samplingpercentages. The measure tab further allows the corresponding measuresto be selected. FIG. 24I is a screenshot illustrating the process tabwhich allows the user to create a library for the wizard output. Inparticular, a search path, library and comments can be inputted to thesystem. FIG. 24J is a screenshot of the result tab showing the output ofthe model training to the user. The foregoing steps of training apredictive model can be done over a Hadoop cluster using dataflowoperations.

FIG. 25A is a screenshot illustrating training a model using theWorkbench 700 generated by the system. Signal Hub could include prebuiltmodels that a user can train (e.g., logistic regression, deepautoencoder, etc.). As shown, the models tab 718 is selected, and a usercan add one or more models, such as “binarize,” “decision tree,”“deepAutoencoder,” “externalModel,” “frequentItems,” “gmm,” “kmeans,”“linearRegression,” and “logisticRegression.” A user can train anexternal model using any desired analytic tool. As long as the modeloutput conforms to a standard pmml format, the Signal Hub platform canincorporate the model result and do the scoring. Under the models tab718, once a model has been selected, more information regarding themodel is easily accessible, such as the description and model path. FIG.25B is a screenshot illustrating preparation of data to train a modelusing the Workbench generated by the system. The Workbench 700 caninclude a data preparation tab 734. Signal Hub can perform in the datapreparation tab 734 missing value imputation 736, normalization 738, andother necessary signal treatment 740 before training the model. FIG. 25Cis a screenshot illustrating different data splitting options providedby Workbench 700. The Workbench 700 can include a data splitting tab 740for allowing input of the number of folds 741, number of seeds 742,percent of validation 743 and stratified input 744. FIG. 26 is ascreenshot illustrating loading an external model trained outside of theintegrated development environment.

FIG. 27 is a screenshot illustrating scoring a model using the Workbench700 generated by the system. Signal Hub prebuilt a number of modelscorers that can perform end to end analytic development activities.FIG. 28 is a screenshot illustrating monitoring model performance usingthe Workbench 700 generated by the system. Signal Hub offers variousmonitoring matrices to measure the model performance (e.g., ROC, KS,Lorenz, etc.). As shown, any of a variety of measures can be used tomonitor and score the model. For example, monitoring measures couldinclude “captureRate,” “categoricalWoe,” “conditionIndex,”“confusionMatrix,” “informatonValue,” “kolmogorovSmirnov,” “Lorenz,”“roc,” etc.

FIG. 29A is a screenshot illustrating a solution dependency diagram 750of the Workbench 700 generated by the system. The diagram 750illustrates various modules for each portion of the analyticsdevelopment lifecycle. For example, the diagram illustrates raw datamodules 760, aggregate and cleanse data modules 762, create descriptivesignals modules 764, select descriptive signal modules 766 (which isalso the develop solutions/models module 766), and evaluate modelresults modules 768.

FIG. 29B is a screenshot illustrating a collaborative analytic solutiondevelopment using the Workbench generated by the system. The system ofthe present disclosure allows users to collaborate on large softwareprojects for code development. In addition to code development,developers can also develop and collaborate on data assets. Besidesstand-alone development mode, Signal Hub Workbench can also be connectedwith version control system (eg: SVN, etc.) in the backend to supportcollaborative development. Users can create individual workspaces andsubmit changes directly from Workbench user interface. CentralizedWorkbench also enables users to learn the different activity streamshappening in the solution. Files that are being worked on by otherdevelopers would show up as locked by the system automatically to avoidconflicts. Locked files will become unlocked after the developer submitsthe changes or a solution manager forces to break the lock and all thedevelopers would get a workspace update notification automatically. Thesystem of the present disclosure can implement isolation requirements tofurther facilitate collaboration. For example, the system can isolateupstream code and data changes. If a developer is reading the results ofa view or signal set, she expects them not to change without herknowledge. If a change has been made to the view, either because theunderlying data has changed or because the code has changed, it shouldnot automatically affect her work until she decides to integrate theupdates into her work stream. Additionally, the system can protect adeveloper's code and data from the other developers' activities.Further, the system can also allow a user to decide when to make theirwork public. A user has the ability to develop new code without worryingabout affecting the work of those downstream. When the work iscompleted, the user can then “release” her version of the code and data.Users will see this released version and chose whether they'd like toupgrade their view to read it.

The system can further facilitate collaboration by allowing a singlelibrary to be developed by a single developer at one point in time whichwill reduce code merging issues. Furthermore, the system can use sourcecontrol to make code modifications. A user can update when she wants toreceive changes from her team members, and commit when she wants them tobe able to see other developers' changes. Each developer at a point intime can be responsible for specific views and their data assets. Theowner of the view can be responsible for creating new versions of theirdata while other developers can only read the data that has been madepublic to them. Ownership can change between developers or even to acommon shared user. A dedicated workspace can be created in the sharedcluster which can be read-only for other developers and only the ownerof the workspace can write and update data. When new code and data isdeveloped, the developer can commit the changes to the source controland publish the new data in the cluster to the other developers. Thisallows the other developers to see the code changes and determine ifthey would like to integrate it with their current work.

FIG. 29C is a screenshot of a common environment file that contains codeand library output paths to grant every developer access to the code anddata of every developer, regardless of where the data resides. Thedefinitions in the file can be referenced with a qualified name insteadof a filepath. This allows an easy move from one workspace to anotherwithout changing the code by making a small change to the commonenvironment file. FIG. 29D is a screenshot of a separate personalenvironment file for a user working on a subset of a project. The filebegins by inheriting the common project environment file“env_project.yaml.” Thus, all of the parameters set in the generalenvironment file will also apply if you run with the personal file, suchas “env_myusername.yaml.” Any parameters that are also defined in thepersonal file, in this case “etlVersion,” will be over-ridden. So if theworkflows are run with “env_project.yaml,” the “etlVersion” will be 1.4.If the workflows are run with “env_myusername.yaml,” then “etlVersion”will be 1.5. With either environment file, “importVersion” will be 1.1.FIGS. 29E-29I are screenshots of environment files having multipleoutput paths. The system can also allow users to have multiple outputpaths for the data views using the “libraryOutputPaths” parameter in theenvironment file. These paths can be specified as a map between alibrary and a file path in which to place the data of that library. Fora shared Hadoop cluster, the file path can point to a folder on HDFS.The default data location can still be decided using the“dataOutputPath” if the library is not mapped to any new location. Usingthis map, each library can be assigned to a unique data location. Theproject owner can therefore, map each library to a directory that isowned by a given developer. This can further allow data view abstractionmodes for maintaining fast incremental data updates without underlyingfilesystem support for the data update

FIG. 29J is a screenshot of code for data versioning. Data versioning isthe ability to store different generations of data, and allow othercollaborators to decide which version to use. To achieve this, users canversion their data using the label properties of views. There are twoways of doing this: one in the view itself, the other in the commonenvironment file. The code is shown in FIG. 29J. Every time the view isexecuted, the version of the view data can be determined by the label.If a new label is used, a new folder can be created with a new versionof the view's data. The granularity of this versioning is up the user;she can choose to assign the version number to just one view or to somesubset, depending on the needs of the project. Every time the user wantsto publish to her team members a new version of “myView, the user canincrement “myView_LatestVersion” in the common environment file. Thischange can indicate either a code update or a data update. Additionally,the user may add a comment to the environment file giving informationabout this latest version, including when it was updated, what thechanges were, etc. The user can then commit the common environment filewith the rest of her code changes. With this information, users of theview further downstream can choose whether they'd like to upgrade to thelatest version or continue using an earlier version. If the downstreamusers would always get the latest version, they can use the samevariable “myView_LatestVersion” in the label parameter of the “readView”for “myView.” Since they share the same common environment file, thelatest value will be used when a user updates her code from system. Ifthe user wants to stay with an existing version, the user can overridethe version in their private environment file to a specific version.Once a version is “released,” the permissions on that directory can bechanged to make it non-writable even for the developer herself, so thatit is not accidentally overwritten. This can allow users to setdifferent version numbers and “libraryOutputPaths.” For example, theproject-level environment file (the one users are using by default) canhave the latest release version for a given view. The user developing itcan have a private environment file with a later version. The user cando this by including the same version parameter in her file and runningthe view with her private environment file. This can allow the user todevelop new versions while others are reading the older stable version.

In most cases, users can “own” a piece of code, either independently oras a team. They can be responsible for updating and testing the code,upgrading the inputs to their code, and releasing versions to beconsumed by other users downstream. Thus, if the team maintaining agiven set of code needs an input upgraded, they can contact the teamresponsible for that code and request the relevant changes and newrelease. If the team upstream is not able to help, the user can changethe “libraryOutputPaths” for the necessary code to a directory in whichthey have permissions. It involves no code changes past the small changein the environment file. If the upstream team is able to help, they canmake the release. This allows collaboration with minimum disruption.

FIGS. 30-32 are screenshots illustrating the Signal Hub manager 800generated by the system to manage user access to overall Signal Hubplatform and analytic operation process. The Signal Hub manager 800provides a management and monitoring console for analytic operationalstewards (e.g., IT, business, science, etc.). The Signal Hub manager 800facilitates understanding and managing the production quality andcomputing resources.

FIG. 30A is a screenshot of the Signal Hub manager 800 generated by thesystem. The Signal Hub manager 800 facilitates easy viewing andmanagement of signals, signal sets, and models. The management consoleallows for the creation of custom dashboards and charting, and theability to drill into real time data and real time charting for acontinuous process. As shown, the Signal Hub manager 800 includes adiagram view. In this view, the Signal Hub manager 800 could include adata flow diagram 802 showing the general data flow of raw data tosignals to models. Further, the Signal Hub manager 800 could include achart area 804 providing a variety of information about the data,signals, signal sets, and models. For example, the chart area 804 couldprovide one or more tabs related to performance, invocation history,data result, and configuration. The data result tab could includeinformation such as data, data quality, measure, PMML, and graphs. TheSignal Hub manager 800 could also include additional information asillustrated in window 806, such as performance charts and heat maps. Thechart area allows a user to drill down on every workflow to easilyunderstand the processing of all views involved in the execution of ause case.

FIG. 30B is a screenshot for user access management of the Signal Hubmanager 800 generated by the system. The Signal Hub manager 800 providesrole-based access control for all Signal Hub platform components toincrease network security in an efficient and reliable way. As shown,users are assigned to different groups and different groups areauthorized with different permissions including admin, access, operate,develop and email. Besides global permission management, Signal Hubplatform also allows admin user to manage authentication andauthorization on solution basis.

FIG. 30C is a screenshot for overall Signal Hub platform usage trackingof the Signal Hub manager 800 generated by the system. As shown, a useris able to download the usage report from Signal Hub manager userinterface to track how other user are using different Signal Hubplatform components by detailed event (e.g. login, entering KnowledgeCenter, create a report, create a dashboard, etc.) and conduct furtheranalysis on top of it.

FIGS. 31A-B are screenshots for alerts system of the Signal Hub manager800 generated by the system. Based on monitor system stats, a user canset up alerts at different level including system level alert, workflowlevel alert and view level alert. Signal Hub platform also allows userto set up different types of alert (eg: resource usage, execution time,signal value drift, etc), define threshold and trigger recoverybehaviors (eg: email notification, fail job, roll back job)automatically. The alert feature enables users to better track solutionstatus from both operational and analytic perspectives and greatlyimproves solution operation efficiency. FIG. 31A is another screenshotof the Signal Hub manager 800 generated by the system. The Signal Hubmanager 800 includes a table view. In this view, the Signal Hub managerincludes a data flow table of information regarding the general dataflow of raw data to signals to models. The data flow table includes viewname, label, status, last run, invocation number (e.g., success number,failure number), data quality (e.g., treated number, rejected number),timestamp of last failure, current wait time, average wait time, averagerows per second, average time to completion, update (e.g., input recordnumber, output record number), historical (input record number, outputrecord number), etc. Similar to the diagram view discussed above, thetable view could also include a chart area. For example, the chart area804 could provide one or more tabs related to performance, invocationhistory, data result, and configuration. The invocation history tabcould include invocation, status, result, elapsed time, wait time, rowsper second, time to completion, update (e.g., input record number,output record number), and historical (e.g., input record number, outputrecord number). FIG. 31B is a screenshot illustrating overall Signal Hubplatform usage tracking of the Signal Hub manager 800 and alertfunctionality generated by the system. As shown, a user is able todownload the usage report from Signal Hub manager user interface totrack how other user are using different Signal Hub platform componentsby detailed event (e.g. login, entering Knowledge Center, create areport, create a dashboard, etc.) and conduct further analysis on top ofit.

FIG. 32 is another screenshot of the Signal Hub manager 800 generated bythe system. More specifically, shown is the monitor system of the SignalHub manager 800. This facilitates easy monitoring of all analyticprocesses from a single dashboard. The current activities window 810 hasa table which includes solution names, workflow names, status, last run,success number, failure number, timestamp of last failure, and averageelapsed time. The top storage consumers window 812 has a table whichincludes solution names, views, volume, last read, last write, number ofvariants, number of labels. The top run time consumers window 814 has atable which includes solution names, views, run time, number parallel,elapsed time, requested memory, and number of containers. A user is alsoable to drill down to a specific solution, workflow, or view to learnabout their operational status.

FIG. 33 is a diagram showing hardware and software components of thesystem 100. The system 100 comprises a processing server 102 which couldinclude a storage device 104, a network interface 108, a communicationsbus 110, a central processing unit (CPU) (microprocessor) 112, a randomaccess memory (RAM) 114, and one or more input devices 116, such as akeyboard, mouse, etc. The server 102 could also include a display (e.g.,liquid crystal display (LCD), cathode ray tube (CRT), etc.). The storagedevice 104 could comprise any suitable, computer-readable storage mediumsuch as disk, non-volatile memory (e.g., read-only memory (ROM),eraseable programmable ROM (EPROM), electrically-eraseable programmableROM (EEPROM), flash memory, field-programmable gate array (FPGA), etc.).The server 102 could be a networked computer system, a personalcomputer, a smart phone, tablet computer etc. It is noted that theserver 102 need not be a networked server, and indeed, could be astand-alone computer system.

The functionality provided by the present disclosure could be providedby a Signal Hub program/engine 106, which could be embodied ascomputer-readable program code stored on the storage device 104 andexecuted by the CPU 112 using any suitable, high or low level computinglanguage, such as Python, Java, C, C++, C#, .NET, MATLAB, etc. Thenetwork interface 108 could include an Ethernet network interfacedevice, a wireless network interface device, or any other suitabledevice which permits the server 102 to communicate via the network. TheCPU 112 could include any suitable single- or multiple-coremicroprocessor of any suitable architecture that is capable ofimplementing and running the signal hub program 106 (e.g., Intelprocessor). The random access memory 114 could include any suitable,high-speed, random access memory typical of most modern computers, suchas dynamic RAM (DRAM), etc.

In certain exemplary embodiments, an interactive reporting tool allowsthe user to define, modify, and selectively execute a sequence ofqueries (which also may be referred to as query steps of an overallquery) in an interactive manner. Generally speaking, such an interactivereporting tool provides user interface screens including an interactivequery code table into which the user can configure a sequence of queriesand also including a data table in which the results of a particularquery or sequence of queries is displayed. The interactive query codetable may be configured to include a plurality of rows, with each rowrepresenting a query. In certain exemplary embodiments, the interactivequery code table may be configured to include a plurality of cellsorganized into rows and columns, with each column representing adistinct query parameter (signal) from among a plurality of queryparameters (signals), and with each row representing a query involvingat least one of the distinct query parameters (signals) by way of aquery operator in the column/cell corresponding to each distinct queryparameter (signal) involved in the query. Thus, an interactive querycode table generally includes n rows representing a sequence of nqueries identifiable as queries 1 through n. When the user selects agiven row i or a cell in a given row i (where i is between 1 and n,inclusive), then the data table is updated to show the results of thequeries 1 through i, such that the user effectively can step through thequeries in any order (i.e., sequential or non-sequential, forward orbackward) to see the results of each step. Furthermore, if the userchanges the query in a given row (e.g., by adding a new query parameteror changing an existing query parameter in a column/cell of the row),then the prior results from the queries associated with rows i through nmay be invalidated, and at least the changed query (and typically all ofthe queries 1 through i) is executed in order to update the data tableto include results from the changed query. Embodiments may cache theresults of various queries so that unchanged queries need not bere-executed. Embodiments additionally or alternatively may place atemporary lock on a set of queries in the interactive query code tableso as to limit the types of changes that can be made to the interactivequery code table, e.g., allowing only the addition of new queries whilethe temporary lock is in place. The lock can be enforced, for example,by prompting the user upon receiving a user input that would change anexisting query in the interactive query code table.

FIG. 34 is a screenshot illustrating a sample interactive reporting userinterface screen 3400 generated by Signal Hub platform 600, inaccordance with an exemplary embodiment. In the screenshot, a user isalso able to create step-by-step queries on signals in an interactivequery code table 3404 and view the initial results on sample data atreal time in a data table 3406. In this example, the interactive querycode table 3404 and the resulting data table 3406 are included in thesame user interface screen 3400, although it should be noted that thetwo table could be provided in separate user interface screens invarious alternative embodiments.

In the example shown in FIG. 34, the interactive query code table 3404includes two queries 3408 and 3410. Query 3408 effectively identifiesall customers with signal value of “SPM2” greater than 0.1 and withsignal value of “cmdist_homeipr_A” greater than 5. Query 3410effectively aggregates those customers by their “WALLET_SIZE_BUCKET,”gets the average for the “cmdist_homeipr_A” signal, and counts how many“CUSTOMER_ID” are there in each bucket. The user has selected cell 3412in query 3408, as indicated by the highlighting of the cell 3412 and therepresentation of the query parameter of the selected cell 3412 incommand line 3402. The data table 3406 shows the results from query 3408rather than the results of query 3410.

FIG. 35 is a screenshot illustrating a sample interactive reporting userinterface screen 3500 generated by Signal Hub platform 600, inaccordance with an exemplary embodiment. Here, the user has selectedcell 3512 in query 3410, as indicated by the highlighting of the cell3512 and the representation of the query parameter of the selected cell3512 in command line 3502. The data table 3506 shows the results fromquery 3410 rather than the results of query 3408.

With reference again to FIG. 34, assume that the user has selected cell3414 in query 3408 and changed the value from 0.1 to 0.11. In this case,row 3416 would be omitted from data table 3406 because the signal valueof SPM2 would now be less than the filter value of 0.11, i.e., thecustomer with Customer ID 768 now would not meet the filtering criteriaof modified query 3408. FIG. 36 shows a representation of the screenshotof FIG. 34 with the SPM2 signal value changed to 0.11 in query 3408 androw 3416 struck from data table 3406.

If the user then selected cell 3512 in query 3410 as in FIG. 35, thequery 3410 would be executed based on the results obtained from themodified query 3408 and hence would omit the data associated with datatable row 3416. FIG. 37 shows a representation of the screenshot of FIG.35 based on the changes reflected in FIG. 36, i.e., the count ofcustomers having a WALLET_SIZE_BUCKET of 4000-8000 has decreased by onerelative to the data table shown in FIG. 35 (the value of the averagefor the “cmdist_homeipr_A” signal likely would change, too, although forconvenience the value shown in FIG. 37 is the same as the value shown inFIG. 35).

Thus, as discussed above, the user can switch between different querysteps by moving the mouse cursor and view the result for correspondingstep(s). For example, in FIG. 34 the user made a selection in query 3408and was able to view the results of that query without the furtherprocessing of query 3410, which was already configured in theinteractive query code table. The user then could make a selection inquery 3410 to view the results of queries 3408 and 3410, discussed abovewith reference to FIG. 35. The user could then make a change in query3408 and view the results of that modified query, as discussed abovewith reference to FIG. 36. The user could then make a selection in query3410 to review the results of modified query 3408 and query 3410.

FIG. 38 is a flowchart of a computer-implemented method for interactivedatabase reporting, in accordance with one exemplary embodiment. Inblock 3802, a first graphical user interface screen is displayed on adisplay screen of a computer of a user. The first graphical userinterface screen includes a plurality of rows, each row representing adatabase query, the interactive query code table including n rowsidentifiable as rows 1 through n and representing a sequence of ndatabase queries identifiable as database queries 1 through n, wherein nis greater than one. In block 3804, a first user input associated with agiven row i of the interactive query code table is received. In block3806, a first data table including results from execution of queries 1through i is displayed, such that the user is able to select any givendatabase query in the interactive query code table to view resultsthrough the given database query. In block 3808, a second user inputmaking a change to the database query associated with the given row isreceived. In block 3810, prior results from the database queriesassociated with rows i through n are optionally invalidated. In block3812, at least the changed database query associated with the given rowis executed, although any number of queries associated with rows 1through i may be executed. In block 3814, a second data table includingthe results from execution of the changed database query is displayed.

In order to allow for such interactive database reporting in a dynamicsystem in which query parameters (signals) may change from time to time,the Signal Hub platform 600 may store temporary copies of queryparameters (signals) that are used in the queries so that the resultscan be replicated and/or manipulated using a baseline of queryparameters (signals). Also, results from one or more of the queries maybe cached so that, for example, when the user steps from one row/queryto another row/query, the resulting data table can be produced quicklyfrom cached data. It should be noted that the interactive databasereporting tool can be implemented using virtually any database or queryprocessing engine and therefore is not limited to any particulardatabase or query processing engine.

In any database query system, including embodiments described above,queries are often executed against big data sets, and each “pass” madeover the data set to execute such queries can take many hours tocomplete depending on the data set size and the number of types ofqueries. It is helpful, then, to minimize the number of “passes” madeover the data set when executing the queries in order to optimize queryexecution.

Therefore, in certain exemplary embodiments, a sequence of queries isdivided into stages, where each stage involves one pass over the data,such that the sequence of queries can be executed using the minimumnumber of passes over the data. Specifically, in certain exemplaryembodiments, the sequence of queries is processed into a functionaldependency graph that represents the relationships between queryparameters (signals) and query operations, and the functional dependencygraph is then processed to divide the queries into a number ofsuccessive stages such that each stage includes queries that can beexecuted based on data that exists prior to execution of that stage. Asequence of queries may, and often does, require that one or moreintermediate values or datasets be generated using an aggregate function(i.e., a function that needs to pass over the entire dataset in order tocompute a value) for use or consumption by another query. A query thatproduces a given intermediate result must be executed at a stage that isprior to execution any query that consumes the intermediate data set. Itis possible for multiple intermediate results to be created at a givenstage.

By way of example, the following set of expressions involves generationand consumption of an intermediate value when converted into databasequeries:

sum_discount=sum(discount);

avg_amt=avg(amt);

ct_large_amt=count( ) when amt>avg_amt;

In this example, the value “avg_amt” (average amount) needs to becomputed before a count can be made of the number of records having an“amt” (amount) that is greater than the average amount “avg_amt”. Toexpress this in SQL, the developer would have to plan this in twostages.

As the number of expression increases, it becomes more difficult for thedeveloper to form the necessary nested queries. For example, thefollowing set of expressions involves generation and consumption of twointermediate values when converted into database queries:

sum_discount=sum(discount);

avg_amt=avg(amt);

ct_large_amt=count( ) when amt>avg_amt;

ct_prod_with_large_amt=count_unig(prod_id) when ct_large_amt>=10

In this example, ct_large_amt is an aggregate function that cannot beexecuted until avg_amt is determined, and ct_prod_with_large_amt is anaggregate function that cannot be executed until ct_large_amt isdetermined. To express this in SQL, the developer would have to planthis in three stages.

Therefore, in certain exemplary embodiments, a functional dependencygraph is first produced from the set of expressions. FIG. 39 is aschematic diagram showing a functional dependency graph for the aboveexample, in accordance with one exemplary embodiment.

The functional dependency graph is then traversed using aspecially-modified breadth-first search traversal to assign eachexpression to an execution stage. Breadth-first search traversal ofgraphs/trees is well-known for tracking each node in the graph/tree thatis reached and outputting a list of nodes in breadth-first order, i.e.,outputting all nodes at a given level of the graph/tree beforeoutputting any nodes at the next level down in the graph/tree. Inexemplary embodiments, when evaluating nodes at a given level of thegraph/tree associated with a given execution stage, thespecially-modified breadth-first search traversal will not associate aparticular lower-level node with the given execution stage if that nodeis associated with an aggregate function. Instead, the lower-level nodeassociated with an aggregate function is placed in a later executionstage in accordance with the specially-modified breadth-first searchtraversal. The following is an algorithmic description for assigningeach expression to an execution stage based on the functional dependencygraph, in accordance with one exemplary embodiment:

-   -   1. Make each input column an input node in the graph    -   2. For each given expression, transform the expression into the        functional dependency graph and connect it to the inputs or        other expressions.    -   3. Set PhaseID to 0    -   4. Perform the specially-modified breadth-first search (BFS)        graph traversal in accordance with Steps 5 through 21.    -   5. Breadth-First-Search—Aggregates Expressions (Graph, root):    -   6.    -   7. create empty set S    -   8. create empty queue Q    -   9.    -   10. add root to S    -   11. Q.enqueue(root)    -   12.    -   13. while Q is not empty:    -   14. current=Q.dequeue( )    -   15. if current is the goal:    -   16. return current    -   17. for each node n that is adjacent to current:    -   18. if n is not in S AND n is not Aggregate function:    -   19. add n to S    -   20. n.parent=current    -   21. Q.enqueue(n)    -   22. Add emitted nodes to the current phase.    -   23. Remove nodes emitted in Step 22 from the graph, increase        phaseID by 1, and repeat Step 4 until no more nodes remain in        the graph.

Steps 5 through 21 listed above are based on a standard breadth-firstsearch traversal methodology published in WIKIPEDIA but with theaddition of the conditional “AND n is not Aggregate function” in orderto omit lower-level aggregate functions from being included in a givenexecution stage. The goal of the standard BFS traversal methodology isto queue/output the nodes from the graph in breadth-first order, i.e.,the root, followed by all nodes one layer down from the root, followedby all nodes two layers down from the root, and so on. Literallyapplying this specially-modified BFS traversal methodology to the graphshown in FIG. 39 could result in more than three stages with some stagesrequiring no execution (e.g., the root node and certain operator nodesrequire no execution in and of themselves). In essence, then, thespecially-modified BFS traversal methodology would be applied to thenodes associated with executable expressions, in which case threeexecution stages would be generated, as follows:

Stage 1 sum_discount = sum(discount); avg_amt = avg(amt); 2 ct_large_amt= count( ) when amt > avg_amt; 3 ct_prod_with_large_amt =count_uniq(prod_id) when ct_large_amt >= 10

Thus, this particular specially-modified BFS traversal methodology ispresented for example only, and implementations may use a variation ofthis methodology or other, suitably-modified, BFS traversal methodology.

Once the expressions are assigned to different stages, database queriesare automatically generated in accordance with known methods so that theexpressions can be executed in accordance with the stages. The followingis example SQL code for executing the staged expressions from theexample above:

Select count(prod_id) as ct_prod_with_large_amt From T1, (SelectCount(amt > avg_amt) as ct_large_amt , sum_discount  From T1 Join(Select Day, avg(amt) as avg_amt, sum(discount) as  sum_discount, FromInput  Group by Day) T_AVG on T_AVG.Day = T1.Day Group by Day) T_CNT onT_CNT.Day = T1.Day Where ct_large_amt > 10 Group By Day

Here, each phase is a single flat query with the same grouping key asthe original grouping. The result of each phase is joined with theinputs using the grouping key, and made available for the next phase.Any relational engine can now execute these queries in a pipeline usingthe relational schema illustrated by the SQL syntax above. It should benoted that this algorithm is not only applicable to Database thatsupport SQL, but for any relational engine or data flow engine such asApache Spark and Apache Tez.

While the example above uses simple aggregate functions such as sum,count, count_unique, and simple scalar functions such as “>=” and “>”,this schema can be easily expanded to any aggregate and scalarfunctions, where aggregate functions are functions that need to passover the entire data in order to compute a value and scalar functionscan be computed on each record. Using this designation, functions can becategorized appropriately and the schema can work on any set ofexpressions. Thus, starting with a simple non-nested set of expressionsthat have inter-dependency, a relational nested representation of thecomputation is produced.

FIG. 40 is a flowchart for query execution optimization in accordancewith one exemplary embodiment. In block 4010, a functional dependencygraph representing the scalar expressions and any interdependenciesbetween the scalar expressions is produced. In block 4020, each scalarexpression is assigned to one of a plurality of successive executionstages based on the functional dependency graph. In block 4030, thescalar expressions at each stage are converted into one or morerelational database queries such that each execution stage involves atmost one pass through the dataset.

In certain exemplary embodiments, the functional dependency graph isproduced by processing each expression as follows. An output parameterand a set of input parameters associated with the scalar expression areidentified, as in block 4012. A node for the output parameter is createdin the functional dependency graph if the node for the output parameterdoes not exist in the functional dependency graph, as in block 4014. Forfor each input parameter in the set of input parameters, a distinct nodeis created in the functional dependency graph if the distinct node forthe input parameter does not exist in the functional dependency graph,as in block 4016. An association is established in the functionaldependency graph between the output parameter node and each of thedistinct input parameter nodes, as in block 4018.

In certain specific exemplary embodiments, the Signal Hub server 600 maybe configured to divide a sequence of queries into stages such that thesequence of queries can be executed using the minimum number of passesover the data set. For example, the Workbench discussed herein mayprovide user interface screens through which the user can enter asequence of expressions that are then divided into stages and convertedinto database queries (e.g., SQL queries), as discussed above.Additionally or alternatively, the Signal Hub server 600 may apply queryexecution optimization as discussed herein to any set of queries, suchas, for example, queries executed as part of the interactive reportingtool discussed above.

As discussed above, e.g., with reference to FIGS. 29B-29J, in acollaborative data processing environment such as the Signal Hub system600, datasets are updated by two main activities, namely production anddevelopment. Production activity is handled by workflows that automatethe processing of the data, ingest the new incoming data, and update thedatasets. Development activity includes developers making code changes,creating new datasets, and update existing datasets. In theseenvironments, datasets are often derived from other datasets, which canbe described by a dependency graph like the example shown in FIG. 41,where datasets V3 and V4 are derived from dataset V2, which in turn isderived from dataset V1. Furthermore, development and productiongenerally share datasets. Consequently, a code and data versioning maybe used to allow developers to work on code and data without interferingwith one another or with production code and data, which should not beaffected by development activities.

For example, with reference again to FIG. 41, Developer A might want tosee only the V2 production dataset that has been through qualityassurance (QA) and deployed to the system, while Developer B might wantto change the V2 production dataset to test some code changes andcontinue developing the V4 dataset on changes applied to the V2 datasetdata (although without changing the V2 production dataset itself).

In typical coding environments, a source control system is used to allowdevelopers to make code changes without affecting each other's work andwithout affecting the production code, until changes are ready to becommitted for production. Source control systems allow each developer towork on a separate version of the code, but when development is done andthe changes can be shared with others, the code can be committed intothe shared version. However, in data processing environments, whilesource control systems can be used to allow for source control,versioning, and collaboration on the code, large datasets cannot beversioned efficiently via the source control system. In addition,because of the dependency graph shown in FIG. 41, it is challenging tomaintain which version of code is associated with each dataset.

To resolve these challenges, certain exemplary embodiments provide codeand data versioning based on the concept of “views” and “workflows,”where a view is a defined by a relational processing logic (e.g., aquery) that consumes data from other views or data sources and persiststhe processed data into files (e.g., on the append-only file system),and a workflow is a directed graph of views where the nodes are viewsand the edges between views represent that one view reads from anotherview (e.g., an edge between a view v1 to a later view v2 describes thatv2 reads from v1). A workflow engine can traverse a workflow graph andexecute each view in topographical order.

The code and data versioning method can be summarized as follows:

-   -   1. The source code of the views is managed by a source control        system.    -   2. Each developer has a separate workspace that contains a        snapshot of the code from the source control system.    -   3. Production workflows will share the same workspace.    -   4. When a developer makes a code change, the code change will        have to be committed before other workspaces can be updated (via        a separate action) to reflect the code change.    -   5. Before a developer can make a code change, a lock is placed        on the view, allowing only a single developer to work on a        single dataset.    -   6. From the moment a developer makes a code change until the        time the code changes are committed, the developer is associated        with a Task that describes the activity for which the code        changes are made.

These steps involve standard source control operations that aresupported in one way or another by various source control system (e.g.,SVN, Git), such as Start Task, Finish Task, and Update Workspace. When auser starts a new task, any change to the code will invoke a lock on thecode file with the changes and prevent other users from changing thiscode file. Once a user finishes a task, all code is committed into theshared version and all locks are released. The Update Workspaceoperation updates the developer workspace files from the shared version.

In order to support code and data versioning using shared datasets, thefollowing additional steps are involved:

-   -   7. Each View has a write operation and read operation. These        operations must support adding a version while producing the        output data. If a new version name is given to the write        operation, the entire data will be output to a new sub-directory        with this version name. If the same version name is given, then        the data in that version will be overwritten by the write        operation.    -   8. Production workflows will all use a “no version” value for        all read\write operations.    -   9. When a developer runs a definition from a workspace, if the        definition includes changes, the version will receive the Task        name that is currently associated with the changes. If the        definition is unchanged and represents the latest version of the        code in the source control system, then the data version will be        referred to as “latest”.    -   10. The system will maintain, for each developer workspace, a        version map that stores which views have been executed for the        developer and what was the version value for that view. Any read        operation on the view while executing subsequent views from the        developer workspace will be instrumented by the version value.        In addition, views that are not executed from the workspace will        be read without a version and therefore will be read from the        production data. When finishing a task and committing the code,        the map can be cleared.

In order to accomplish such code and data versioning, the Start Task,Finish Task, and Update Workspace operations are updated as follows.Once a user starts a new task, any change to the code will invoke a lockon the file with the changes and prevent other users from changing thiscode file, and locked changed code will also change the version asdescribed in Step 7 while a task is active. Once a user finishes a task,all code is committed into the shared version and all locks arereleased, and then the version for changed definitions will switch backto “latest.” The Update Workspace operation will update the developerworkspace files from the shared version.

By way of example, imagine that there is a view V1 containing theproduction code and data. A production process in view V2 includes codethat contains an instruction to read from view V1. When the productionprocess in view V2 is executed with no version specified, the systemautomatically reads from the “no version” production data in view V1, asshown schematically in FIG. 42.

Imagine instead that view V2 is associated with a developer who has madecode changes in view V2 but no changes to code or data associated withview V1. The code in view V2 again contains an instruction to read fromview V1. A temporary data set is created in view V1 for use by thedeveloper in view V2 (i.e., so that any production changes that occur donot affect the developer's work, and any changes made by the developerdo not affect the production data). When the development process in viewV2 is executed with no version specified, the system automatically readsfrom the temporary dataset rather than from the production dataset, eventhough the code was not modified to read from the temporary dataset(i.e., the system automatically correlates the temporary dataset withthe development view). When the developer checks in the code changesfrom view V2, the temporary dataset in view V1 is renamed to be the“latest,” as shown schematically in FIG. 43. This “latest” dataset iskept separate from the production data because, at the time of check-in,the latest dataset has not yet replaced the production data. At somepoint, the latest version may be re-designated as the productionversion, for example, by a project manager.

Imagine instead that view V2 is associated with a developer assigned to“Task 1” who has made code changes in view V2 and also in view V1 (e.g.,to add a new column to, or change a column definition in, a table inview V1). In this case, a lock is placed on the code file associatedwith view V1 in to prevent other developers from making conflictingchanges, and a temporary dataset is created in view V1 for use byTask 1. The temporary dataset is named Task 1 so that it is correlatedwith the Task 1 development view, i.e., view V2. When the developmentprocess in view V2 is executed with no version specified, the systemautomatically reads from the Task 1 temporary dataset rather than fromthe production dataset, as shown schematically in FIG. 44.

One major advantage of this code and data versioning methodology is thatdevelopers do not have to modify code to read from a specific dataset.Rather, the system automatically determines the dataset to use for agiven development task from among a production dataset, a latestdataset, or a temporary dataset associated with the development task.

In certain exemplary embodiments, code and data versioning, as well asother data storage operations, utilize an append-only file system suchas the Hadoop distributed file system (HDFS). In order to support arelational database, insert, update and delete record operations must besupported over a table of records. Typically, this requires the databaseto update a file stored within the underlying file system on which thedatabase is persisted. However, append-only file systems, such as HDFS,generally do not provide update operations, i.e., one cannot update anexisting file in the file system. For append-only file systems, the onlychange operations allowed are create file, append to file, delete file,create directory, and delete directory. For this reason, databases(e.g., Apache Hive) that are implemented over the append-only filesystem generally do not support update operations.

Another problem that typically occurs is the need to perform a recoveryor a rollback when an error is found after a change to a table. Databasesystems typically support transactions to abort changes performed totables (e.g., akin to an “undo” function). However, such transactionsgenerally have a short lifespan (e.g., must be performed soon after theunwanted change is made) and are not adequate for large changes onmultiple tables. Such transactions are generally also built on filesystem update support that allows the database system to update “dirty”records and clear them when the records are committed, which isgenerally not possible with append-only file systems.

Therefore, certain exemplary embodiments provide support for update anddelete operations in an append-only file system. Specifically, a folderis created for each view, and a new subfolder is created for each updateto the view. Each subfolder is associated with a timestamp indicatingthe time the update was made to the view. The following is an example ofa folder structure for a view named “View 1”:

Directory “View 1”

-   -   Directory “timestamp1”        -   File “datafile1”        -   File “datafile2”    -   Directory “timestamp2”        -   File “datafile4”        -   File “datafile7”    -   Directory “timestamp3”        -   File “datafile6”        -   File “datafile9”

For every execution of the view, a new timestamp is generated, and newdata is inserted into files in the new directory. This structure on itsown supports appending new records as they go to new files but stilldoes not support update or delete operations.

Therefore, in certain exemplary embodiments, update and deleteoperations always work on a primary unique key that can identify theunique logical record being changed. For example, a table of customersmay have a unique customer identifier (customerID) representing thecustomer entity. When an update operation needs to change a certainfield in a record, it provides the key for the operation, the field tochange, and the new value for the field. The following is an example ofsuch an update operation:

Update Customer where customerID=1, Set Active=False

This operation changes the value of the field named ‘Active’ to “False”for the record that belongs to the Customer with ID 1.

Due to the lack of update operations in the append-only file system,exemplary embodiments use the append operation for saving the update andadd a new record representing the new record. For example, assume thatthe value of the field named ‘Active’ is initially set to “True” at Time1 and is later changed to “False” at Time 3 for the record that belongsto the Customer with ID 1. A data file containing the record withActive=True would be added to a subfolder associated with the timestampfor Time 1 and a separate data file containing the record withActive=False would be added to a subfolder associated with the timestampfor Time 3. The following is an example of the folder structurecontaining the updated record:

Directory “View 1”

-   -   Directory “timestamp1”        -   File “datafile1”        -   File “datafile2” (Customer ID=1, active=True: first            appearance of this record)    -   Directory “timestamp2”        -   File “datafile4”        -   File “datafile7”    -   Directory “timestamp3”        -   File “datafile6”        -   File “datafile9” (Customer ID=1, active=False: second            appearance of this record)

A read operation that reads only the last timestamp will only have theupdated records of the last update, which represents only partial data.On the other hand, reading all of the files from all timestamps willcontain duplicate records with inconsistent data for the same logicalkey. In order to solve this, exemplary embodiments read all data fromall timestamps and then de-duplicate the records by taking the “latest”record associated with each logical key. This operation can be doneefficiently if the data is partitioned by logical key, and thede-duplication can be done in memory with only slight overhead overnormal read operations.

In order to support a delete operation, a ‘delete record’ is appendedinto the data in a manner similar to the update operation, and the readoperation method is modified. A delete record is a record with the samekey as the record to be deleted but with a special field called ‘delete’that is set to true. The read operation now reads all data from alltimestamps, de-duplicates the records by taking the “latest” recordassociated with each logical key, and then filters the resulting recordsbased on the ‘delete’ field to keep or utilize only records with‘delete’=false. The following is an example of the folder structurecontaining the deleted record:

Directory “View 1”

-   -   Directory “timestamp1”        -   File “datafile1”        -   File “datafile2” (Customer ID=1, active=True: first            appearance of this record)    -   Directory “timestamp2”        -   File “datafile4”        -   File “datafile7”    -   Directory “timestamp3”        -   File “datafile6”        -   File “datafile9” (Customer ID=1, active=False: second            appearance of this record)    -   Directory “timestamp4”        -   File “datafile10”        -   File “datafile11” (Customer ID=1, delete=True: third            appearance of this record)

It should be noted that a periodic “compression” can be run over thefolder in order to eliminate old updates and deleted records by a newsnapshot that contains all logical keys in the same single timestampdirectory. One method to achieve this is to read and process the entiretable as discussed above, write the resulting records back into a newtimestamp directory, and delete all previous directories. The followingis an example of the folder structure following compression:

Directory “View 1”

-   -   Directory “timestamp5”        -   File “datafile12”        -   File “datafile13” (excludes record with Customer ID=1)

From time to time, it might be necessary or desirable to roll back aworkflow to an earlier view. Consider a workflow graph in which view V2reads from view V1 and view V3 reads from view V2, for example, asdepicted in FIG. 41. When a new input is processed through the workflow,each of the views V1, V2, V3 might have updates that it persists. Inorder to roll back the changes, any changes that each of these viewspersisted in a consistent way must be undone. In order to accomplishthis, the workflow engine determines the ‘timestamp’ that each of theseviews will use to persist the changes. Since each delta is associatedwith a specific unique timestamp that is common to all the views thatare in the workflow, for example, as shown in FIG. 45, a rollbackoperation is now a simple method of deleting the folders of all theviews created after the designated timestamp. Since each delta is onlyappending data and does not update any of the other data, the onlychanges are reverted by the deletion of these folders, and anysubsequent read will read the previous version. It should be noted thatrollback information may be lost when a compression operation isperformed.

It should be noted that these described methods for updating anddeleting versions in an append-only file system can be implemented inany data flow engine or database query processor that implementsread\write relational data and is using an append-only file system.

Various embodiments of the present invention may be characterized by thepotential claims listed in the paragraphs following this paragraph (andbefore the actual claims provided at the end of this application). Thesepotential claims form a part of the written description of thisapplication. Accordingly, subject matter of the following potentialclaims may be presented as actual claims in later proceedings involvingthis application or any application claiming priority based on thisapplication. Inclusion of such potential claims should not be construedto mean that the actual claims do not cover the subject matter of thepotential claims. Thus, a decision to not present these potential claimsin later proceedings should not be construed as a donation of thesubject matter to the public.

Without limitation, potential subject matter that may be claimed(prefaced with the letter “P” so as to avoid confusion with the actualclaims presented below) includes:

P1. A computer-implemented method for interactive database queryreporting, the method comprising:

causing display, by a server, on a display screen of a computer of auser, of a first graphical user interface screen including aninteractive query code table having a plurality of rows, each rowrepresenting a database query, the interactive query code tableincluding n rows identifiable as rows 1 through n and representing asequence of n database queries identifiable as database queries 1through n, wherein n is greater than one;

receiving, by the server, a first user input associated with a given rowi of the interactive query code table; and

displaying, by the server, a first data table including results fromexecution of queries 1 through i, such that the server enables the userto select any given database query in the interactive query code tableto view results through the given database query.

P2. The method of claim P1, further comprising:

receiving, by the server, a second user input making a change to thedatabase query associated with the given row i;

executing, by the server, at least the changed database query associatedwith the given row i; and

displaying, by the server, a second data table including the resultsfrom execution of the changed database query i.

P3. The method of claim P2, further comprising:

invalidating, by the server, prior results from the database queriesassociated with rows i through n prior to executing at least the changeddatabase query associated with the given row and displaying the seconddata table.

P4. The method of claim P2, wherein executing at least the changeddatabase query associated with the given row comprises:

executing, by the server, the queries associated with rows 1 through iincluding the changed database query associated with the given row i.

P5. The method of claim P2, further comprising:

placing a temporary lock on the n rows of the interactive query codetable to prevent the user from making inadvertent changes; and

allowing the user to override the temporary lock to provide the seconduser input.

P6. The method of claim P5, further comprising:

allowing the user to enter additional queries into additional rows ofthe interactive query code table but not change existing queries whenthe temporary lock is in place.

P7. The method of claim P1, wherein:

the interactive query code table includes a plurality of cells organizedinto rows and columns, each column representing a distinct queryparameter (signal) from among a plurality of query parameters (signals),each row representing a database query involving at least one of thedistinct query parameters (signals) by way of a query operator in thecolumn/cell corresponding to each distinct query parameter (signal)involved in the database query; and

the first user input includes a selection of a given cell in the givenrow.

P8. The method of claim P2, wherein:

the interactive query code table includes a plurality of cells organizedinto rows and columns, each column representing a distinct queryparameter (signal) from among a plurality of query parameters (signals),each row representing a database query involving at least one of thedistinct query parameters (signals) by way of a query operator in thecolumn/cell corresponding to each distinct query parameter (signal)involved in the database query;

the first user input includes a selection of a given cell in the givenrow; and

the second user input includes a change to the contents of the givencell to add a new query operator to the given cell or change a priorquery operator in the given cell associated with query i.

P9. A system for interactive database query reporting, the systemcomprising:

a computer system having stored thereon and executing interactivedatabase query reporting computer processes comprising:

causing display, on a display screen of a computer of a user, of a firstgraphical user interface screen including an interactive query codetable having a plurality of rows, each row representing a databasequery, the interactive query code table including n rows identifiable asrows 1 through n and representing a sequence of n database queriesidentifiable as database queries 1 through n, wherein n is greater thanone;

receiving a first user input associated with a given row i of theinteractive query code table; and

displaying a first data table including results from execution ofqueries 1 through i, such that the server enables the user to select anygiven database query in the interactive query code table to view resultsthrough the given database query.

P10. The system of claim P9, wherein the interactive database queryreporting computer processes further comprise:

receiving a second user input making a change to the database queryassociated with the given row i;

executing at least the changed database query associated with the givenrow i; and

displaying a second data table including the results from execution ofthe changed database query i.

P11. The system of claim P10, wherein the interactive database queryreporting computer processes further comprise:

invalidating prior results from the database queries associated withrows i through n prior to executing at least the changed database queryassociated with the given row and displaying the second data table.

P12. The system of claim P10, wherein executing at least the changeddatabase query associated with the given row comprises:

executing the queries associated with rows 1 through i including thechanged database query associated with the given row i.

P13. The system of claim P10, wherein the interactive database queryreporting computer processes further comprise:

placing a temporary lock on the n rows of the interactive query codetable to prevent the user from making inadvertent changes; and

allowing the user to override the temporary lock to provide the seconduser input.

P14. The system of claim P13, wherein the interactive database queryreporting computer processes further comprise:

allowing the user to enter additional queries into additional rows ofthe interactive query code table but not change existing queries whenthe temporary lock is in place.

P15. The system of claim P9, wherein:

the interactive query code table includes a plurality of cells organizedinto rows and columns, each column representing a distinct queryparameter (signal) from among a plurality of query parameters (signals),each row representing a database query involving at least one of thedistinct query parameters (signals) by way of a query operator in thecolumn/cell corresponding to each distinct query parameter (signal)involved in the database query; and

the first user input includes a selection of a given cell in the givenrow.

P16. The system of claim P10, wherein:

the interactive query code table includes a plurality of cells organizedinto rows and columns, each column representing a distinct queryparameter (signal) from among a plurality of query parameters (signals),each row representing a database query involving at least one of thedistinct query parameters (signals) by way of a query operator in thecolumn/cell corresponding to each distinct query parameter (signal)involved in the database query;

the first user input includes a selection of a given cell in the givenrow; and

the second user input includes a change to the contents of the givencell to add a new query operator to the given cell or change a priorquery operator in the given cell associated with query i.

P17. A computer program product comprising a tangible, non-transitorycomputer readable medium having stored thereon a computer program forinteractive database query reporting, which, when run on a computersystem, causes the computer system to execute interactive database queryreporting computer processes comprising:

causing display, on a display screen of a computer of a user, of a firstgraphical user interface screen including an interactive query codetable having a plurality of rows, each row representing a databasequery, the interactive query code table including n rows identifiable asrows 1 through n and representing a sequence of n database queriesidentifiable as database queries 1 through n, wherein n is greater thanone;

receiving a first user input associated with a given row i of theinteractive query code table; and

displaying a first data table including results from execution ofqueries 1 through i, such that the server enables the user to select anygiven database query in the interactive query code table to view resultsthrough the given database query.

P18. The computer program product of claim P17, wherein the interactivedatabase query reporting computer processes further comprise:

receiving a second user input making a change to the database queryassociated with the given row i;

executing at least the changed database query associated with the givenrow i; and

displaying a second data table including the results from execution ofthe changed database query i.

P19. The computer program product of claim P18, wherein the interactivedatabase query reporting computer processes further comprise:

invalidating prior results from the database queries associated withrows i through n prior to executing at least the changed database queryassociated with the given row and displaying the second data table.

P20. The computer program product of claim P18, wherein executing atleast the changed database query associated with the given rowcomprises:

executing the queries associated with rows 1 through i including thechanged database query associated with the given row i.

P21. The computer program product of claim P18, wherein the interactivedatabase query reporting computer processes further comprise:

placing a temporary lock on the n rows of the interactive query codetable to prevent the user from making inadvertent changes; and

allowing the user to override the temporary lock to provide the seconduser input.

P22. The computer program product of claim P21, wherein the interactivedatabase query reporting computer processes further comprise:

allowing the user to enter additional queries into additional rows ofthe interactive query code table but not change existing queries whenthe temporary lock is in place.

P23. The computer program product of claim P17, wherein:

the interactive query code table includes a plurality of cells organizedinto rows and columns, each column representing a distinct queryparameter (signal) from among a plurality of query parameters (signals),each row representing a database query involving at least one of thedistinct query parameters (signals) by way of a query operator in thecolumn/cell corresponding to each distinct query parameter (signal)involved in the database query; and

the first user input includes a selection of a given cell in the givenrow.

P24. The computer program product of claim P18, wherein:

the interactive query code table includes a plurality of cells organizedinto rows and columns, each column representing a distinct queryparameter (signal) from among a plurality of query parameters (signals),each row representing a database query involving at least one of thedistinct query parameters (signals) by way of a query operator in thecolumn/cell corresponding to each distinct query parameter (signal)involved in the database query;

the first user input includes a selection of a given cell in the givenrow; and

the second user input includes a change to the contents of the givencell to add a new query operator to the given cell or change a priorquery operator in the given cell associated with query i.

P25. A computer-implemented method for converting scalar expressions torelational database queries applied to a dataset, the method comprising:

producing a functional dependency graph representing the scalarexpressions and any interdependencies between the scalar expressions;

assigning each scalar expression to one of a plurality of successiveexecution stages identifiable as stages 1 through n based on thefunctional dependency graph such that each execution stage includes atleast one of the scalar expressions, wherein the at least one scalarexpression associated with any given stage does not require results froma subsequent stage of the plurality of successive stages; and

converting the scalar expressions at each stage into one or morerelational database queries to create a sequence of relational databasequeries, wherein each execution stage involves at most one pass throughthe dataset.

P26. The method of claim P25, wherein at least one stage involvesgeneration of a temporary data set used in at least one subsequentstage.

P27. The method of claim P25, wherein each stage includes the maximumnumber of possible expressions that can be executed at that stage.

P28. The method of claim P25, wherein producing the functionaldependency graph comprises, for each scalar expression:

identifying an output parameter and a set of input parameters associatedwith the scalar expression;

creating a node for the output parameter in the functional dependencygraph if the node for the output parameter does not exist in thefunctional dependency graph;

creating, for each input parameter in the set of input parameters, adistinct node in the functional dependency graph if the distinct nodefor the input parameter does not exist in the functional dependencygraph; and

establishing, in the functional dependency graph, an association betweenthe output parameter node and each of distinct input parameter nodes.

P29. The method of claim P28, wherein the association includes anoperator node when the output parameter node is dependent on acombination of two or more distinct input parameter nodes, the operatornode representing an operator to be performed on the two or moredistinct input parameters.

P30. The method of claim P25, where assigning each scalar expression toone of the plurality of successive stages based on the functionaldependency graph comprises:

traversing the functional dependency graph using a breadth firsttraversal configured to exclude, from each given execution stage, anyaggregate expression at a lower-level of the functional dependencygraph;

assigning each expression to a given stage based on the breadth firsttraversal.

P31. The method of claim P25, wherein converting the nodes at each stageinto one or more relational database queries to create a sequence ofrelational database queries comprises:

generating a sequence of structured query language commands.

P32. The method of claim P25, wherein assigning each scalar expressionto one of the plurality of successive execution stages based on thefunctional dependency graph ensures the minimum number of passes throughthe dataset for execution of the scalar expressions.

P33. A system for converting scalar expressions to relational databasequeries for application to a dataset, the system comprising:

a computer system having stored thereon and executing computer processescomprising:

producing a functional dependency graph representing the scalarexpressions and any interdependencies between the scalar expressions;

assigning each scalar expression to one of a plurality of successiveexecution stages identifiable as stages 1 through n based on thefunctional dependency graph such that each execution stage includes atleast one of the scalar expressions, wherein the at least one scalarexpression associated with any given stage does not require results froma subsequent stage of the plurality of successive stages; and

converting the scalar expressions at each stage into one or morerelational database queries to create a sequence of relational databasequeries, wherein each execution stage involves at most one pass throughthe dataset.

P34. The system of claim P33, wherein at least one stage involvesgeneration of a temporary data set used in at least one subsequentstage.

P35. The system of claim P33, wherein each stage includes the maximumnumber of possible expressions that can be executed at that stage.

P36. The system of claim P33, wherein producing the functionaldependency graph comprises, for each scalar expression:

identifying an output parameter and a set of input parameters associatedwith the scalar expression;

creating a node for the output parameter in the functional dependencygraph if the node for the output parameter does not exist in thefunctional dependency graph;

creating, for each input parameter in the set of input parameters, adistinct node in the functional dependency graph if the distinct nodefor the input parameter does not exist in the functional dependencygraph; and

establishing, in the functional dependency graph, an association betweenthe output parameter node and each of distinct input parameter nodes.

P37. The system of claim P36, wherein the association includes anoperator node when the output parameter node is dependent on acombination of two or more distinct input parameter nodes, the operatornode representing an operator to be performed on the two or moredistinct input parameters.

P38. The system of claim P33, where assigning each scalar expression toone of the plurality of successive stages based on the functionaldependency graph comprises:

traversing the functional dependency graph using a breadth firsttraversal configured to exclude, from each given execution stage, anyaggregate expression at a lower-level of the functional dependencygraph;

assigning each expression to a given stage based on the breadth firsttraversal.

P39. The system of claim P33, wherein converting the nodes at each stageinto one or more relational database queries to create a sequence ofrelational database queries comprises:

generating a sequence of structured query language commands.

P40. The system of claim P33, wherein assigning each scalar expressionto one of the plurality of successive execution stages based on thefunctional dependency graph ensures the minimum number of passes throughthe dataset for execution of the scalar expressions.

P41. A computer program product comprising a tangible, non-transitorycomputer readable medium having stored thereon a computer program forconverting scalar expressions to relational database queries forapplication to a dataset, which, when run on a computer system, causesthe computer system to execute interactive database query reportingcomputer processes comprising:

producing a functional dependency graph representing the scalarexpressions and any interdependencies between the scalar expressions;

assigning each scalar expression to one of a plurality of successiveexecution stages identifiable as stages 1 through n based on thefunctional dependency graph such that each execution stage includes atleast one of the scalar expressions, wherein the at least one scalarexpression associated with any given stage does not require results froma subsequent stage of the plurality of successive stages; and

converting the scalar expressions at each stage into one or morerelational database queries to create a sequence of relational databasequeries, wherein each execution stage involves at most one pass throughthe dataset.

P42. The computer program product of claim P41, wherein at least onestage involves generation of a temporary data set used in at least onesubsequent stage.

P43. The computer program product of claim P41, wherein each stageincludes the maximum number of possible expressions that can be executedat that stage.

P44. The computer program product of claim P41, wherein producing thefunctional dependency graph comprises, for each scalar expression:

identifying an output parameter and a set of input parameters associatedwith the scalar expression;

creating a node for the output parameter in the functional dependencygraph if the node for the output parameter does not exist in thefunctional dependency graph;

creating, for each input parameter in the set of input parameters, adistinct node in the functional dependency graph if the distinct nodefor the input parameter does not exist in the functional dependencygraph; and

establishing, in the functional dependency graph, an association betweenthe output parameter node and each of distinct input parameter nodes.

P45. The computer program product of claim P44, wherein the associationincludes an operator node when the output parameter node is dependent ona combination of two or more distinct input parameter nodes, theoperator node representing an operator to be performed on the two ormore distinct input parameters.

P46. The computer program product of claim P41, where assigning eachscalar expression to one of the plurality of successive stages based onthe functional dependency graph comprises:

traversing the functional dependency graph using a breadth firsttraversal configured to exclude, from each given execution stage, anyaggregate expression at a lower-level of the functional dependencygraph;

assigning each expression to a given stage based on the breadth firsttraversal.

P47. The computer program product of claim P41, wherein converting thenodes at each stage into one or more relational database queries tocreate a sequence of relational database queries comprises:

generating a sequence of structured query language commands.

P48. The computer program product of claim P41, wherein assigning eachscalar expression to one of the plurality of successive execution stagesbased on the functional dependency graph ensures the minimum number ofpasses through the dataset for execution of the scalar expressions.

P49. A system for rapid development and deployment of reusable analyticcode for use in computerized data modeling and analysis comprising:

a computer system having stored thereon and executing computer processesfor implementing a signal hub, the computer processes comprising:

a signal hub engine configured to generate and monitor a set of namedsignals from a plurality of data sources to provide a reusable signallayer of maintained and refreshed named signals on top of the sourcedata for consumption by analytic code applications; and

a graphical user interface configured to allow users to define signalcategories and relationships used by the signal hub engine to generatethe set of named signals, explore lineage and dependencies of the namedsignals in the signal layer, monitor and manage the signal layerincluding recovery from issues identified by monitoring of the namedsignals by the signal hub engine, and create and execute analytic codeapplications that utilize the named signals.

P50. The system of claim P49, wherein the set of named signals includesdescriptive signals and predictive signals.

P51. The system of claim P49, wherein the signal hub engine isconfigured to generate the set of named signals based on combinations ofsignal categories including entity, transformation, attribute, and timeframe.

P52. The system of claim P51, wherein the signal hub engine isconfigured to associate each named signal with a name that isautomatically generated for the signal based on the source data used togenerate the named signal.

P53. The system of claim P49, wherein the signal hub engine is furtherconfigured to store, for each named signal, metadata providing lineageinformation for the named signal, and to provide the metadata forconsumption by analytic code applications.

P54. The system of claim P49, the graphical user interface is configuredto categorize a plurality of named signals based on taxonomies and allowusers to search for named signals based on the taxonomies.

P55. The system of claim P49, wherein the signal hub engine isconfigured to automatically detect changes from the data sources updatethe set of named signals based on relevant data changes withouttransactional system support.

P56. The system of claim P49, wherein the signal hub engine isconfigured to enable a named signal to be created from at least oneother previously created named signal.

It should be understood by persons of ordinary skill in the art that theterm “computer process” as used herein is the performance of a describedfunction in a computer using computer hardware (such as a processor,field-programmable gate array or other electronic combinatorial logic,or similar device), which may be operating under control of software orfirmware or a combination of any of these or operating outside controlof any of the foregoing. All or part of the described function may beperformed by active or passive electronic components, such astransistors or resistors. A “computer process” does not necessarilyrequire a schedulable entity, or operation of a computer program or apart thereof, although, in some embodiments, a computer process may beimplemented by such a schedulable entity, or operation of a computerprogram or a part thereof. Furthermore, unless the context otherwiserequires, a “process” may be implemented using more than one processoror more than one (single- or multi-processor) computer.

Having thus described the system and method in detail, it is to beunderstood that the foregoing description is not intended to limit thespirit or scope thereof. It will be understood that the embodiments ofthe present disclosure described herein are merely exemplary and that aperson skilled in the art may make any variations and modificationwithout departing from the spirit and scope of the disclosure. All suchvariations and modifications, including those discussed above, areintended to be included within the scope of the disclosure.

What is claimed is:
 1. A computer-implemented method for code and dataversioning for managing shared datasets in a collaborative dataprocessing system including data files and code files, the data filesincluding production data files, the code files including productioncode files that operate on data files, the method comprising:maintaining a first view having a production version of a dataset;creating a task for a developer, the task being associated with a secondview; associating a first code file with the task for the first view,the first code file including code that modifies the dataset; creating atemporary version of the dataset in the first view; associating thetemporary version of the dataset with the task; associating a secondcode file with the task for the second view, the second code fileincluding an instruction to read the dataset from the first view withoutidentifying a specific version of the dataset from the first view; andupon execution of the second code file in the second view, automaticallyreading from the temporary dataset associated with the task based on theassociation of the temporary dataset with the task such that the secondcode file in the second view modifies the temporary dataset.
 2. Themethod of claim 1, further comprising: placing locks on the first codefile in the first view and the second code file in the second view. 3.The method of claim 2, wherein placing the locks on the first and secondcode files comprises checking out the first and second code files from asource control system.
 4. The method of claim 1, further comprising:receiving, from the developer, a request to commit changes made to thefirst and second code files; checking the first and second code filesinto a source control system; changing the temporary dataset to be alatest dataset; and terminating the task.
 5. A system for code and dataversioning for managing shared datasets in a collaborative dataprocessing system including data files and code files, the data filesincluding production data files, the code files including productioncode files that operate on data files, the system comprising: a computersystem having stored thereon and executing computer processescomprising: maintaining a first view having a production version of adataset; creating a task for a developer, the task being associated witha second view; associating a first code file with the task for the firstview, the first code file including code that modifies the dataset;creating a temporary version of the dataset in the first view;associating the temporary version of the dataset with the task;associating a second code file with the task for the second view, thesecond code file including an instruction to read the dataset from thefirst view without identifying a specific version of the dataset fromthe first view; and upon execution of the second code file in the secondview, automatically reading from the temporary dataset associated withthe task based on the association of the temporary dataset with the tasksuch that the second code file in the second view modifies the temporarydataset.
 6. The system of claim 5, wherein the computer processesfurther comprise: placing locks on the first code file in the first viewand the second code file in the second view.
 7. The system of claim 6,wherein placing the locks on the first and second code files compriseschecking out the first and second code files from a source controlsystem.
 8. The system of claim 5, wherein the computer processes furthercomprise: receiving, from the developer, a request to commit changesmade to the first and second code files; checking the first and secondcode files into a source control system; changing the temporary datasetto be a latest dataset; and terminating the task.
 9. A computer programproduct comprising a tangible, non-transitory computer readable mediumhaving stored thereon a computer program for code and data versioningfor managing shared datasets in a collaborative data processing systemincluding data files and code files, the data files including productiondata files, the code files including production code files that operateon data files, which, when run on a computer system, causes the computersystem to execute computer processes comprising: maintaining a firstview having a production version of a dataset; creating a task for adeveloper, the task being associated with a second view; associating afirst code file with the task for the first view, the first code fileincluding code that modifies the dataset; creating a temporary versionof the dataset in the first view; associating the temporary version ofthe dataset with the task; associating a second code file with the taskfor the second view, the second code file including an instruction toread the dataset from the first view without identifying a specificversion of the dataset from the first view; and upon execution of thesecond code file in the second view, automatically reading from thetemporary dataset associated with the task based on the association ofthe temporary dataset with the task such that the second code file inthe second view modifies the temporary dataset.
 10. The computer programproduct of claim 9, wherein the computer processes further comprise:placing locks on the first code file in the first view and the secondcode file in the second view.
 11. The computer program product of claim10, wherein placing the locks on the first and second code filescomprises checking out the first and second code files from a sourcecontrol system.
 12. The computer program product of claim 9, wherein thecomputer processes further comprise: receiving, from the developer, arequest to commit changes made to the first and second code files;checking the first and second code files into a source control system;changing the temporary dataset to be a latest dataset; and terminatingthe task.